Forgetting is Everywhere

Ben Sanati, Thomas L. Lee, Trevor McInroe, Aidan Scannell, Nikolay Malkin, David Abel, Amos Storkey

University of Edinburgh

Preprint

Forgetting remains poorly understood, and this gap limits progress in machine learning. A fundamental challenge is to develop models that can learn continuously from new data without losing knowledge they have already acquired. This problem is particularly apparent in deep learning, where existing methods often struggle to retain past information while adapting to new tasks. A major difficulty lies in defining what it actually means for a learner to forget. The aim of this paper is to clarify forgetting: understand what forgetting is, determine when and why it occurs, and examine how it impacts the dynamics of learning.

Abstract

A fundamental challenge in developing general learning algorithms is their tendency to forget past knowledge when adapting to new data. Addressing this problem requires a principled understanding of forgetting; yet, despite decades of study, no unified definition has emerged that provides insights into the underlying dynamics of learning. We propose an algorithm- and task-agnostic theory that characterises forgetting as a lack of self-consistency in a learner's predictive distribution over future experiences, manifesting as a loss of predictive information. Our theory naturally yields a general measure of an algorithm's propensity to forget. To validate the theory, we design a comprehensive set of experiments that span classification, regression, generative modelling, and reinforcement learning. We empirically demonstrate how forgetting is present across all learning settings and plays a significant role in determining learning efficiency. Together, these results establish a principled understanding of forgetting and lay the foundation for analysing and improving the information retention capabilities of general learning algorithms.

Intuition

When a learner updates its predictions using data it already expects, the update cannot reflect the acquisition of new information. Instead, it indicates a loss or distortion of previously acquired knowledge. This provides a principled way to identify and reason about forgetting.

Forgetting is Ubiquitous

Forgetting occurs throughout deep learning systems. The balance between learning speed and memory retention defines an inherent trade-off, determining how much a model should optimally forget in order to adapt effectively.

Forgetting and Efficiency

Forgetting dynamics have an impact on learning efficiency. Appropriate forgetting dynamics allow models to adapt to new information while retaining acquired knowledge, promoting a balanced trade-off between flexibility and stability.

Learning Interactions

We describe learning as a continuous interaction between a learner and an environment that evolves over time. Time is represented as a sequence of steps, denoted $ t \in \mathbb{N}_0 $. At each step $ t $, the learner performs an action $ Y_t $, observes an outcome $ X_t $ from the environment, and updates its internal state $ Z_t $ based on this new information. The sequence of these actions, observations, and updates forms a stochastic process $ \{(X_t, Y_t, Z_t)\}_t $, which we refer to as the interaction process.

Formally, the interaction process can be written as:

\[ \begin{aligned} \text{Initialisation:} \quad & Z_0 \sim p_{Z_0}, \quad X_0 \sim p_{X_0} \\ \text{Interaction:} \quad & Y_t \sim q_f(\cdot \mid Z_{t-1}, X_{t-1}) && \text{(learner produces an action)} \\ & X_t \sim p_e(\cdot \mid H_{0:t-1}, Y_t) && \text{(environment generates an observation)} \\ & Z_t \sim u(Z_{t-1}, X_t, Y_t) && \text{(learner updates its state)} \end{aligned} \]

Throughout this process, the learner continually adapts to new experiences. As it incorporates information from new observations, some previously encoded knowledge may be altered or lost. This loss of information is what we refer to as forgetting.

In this work, we adopt a predictive perspective to formalise forgetting. Rather than focusing on the learner's state, we examine what it can predict about future observations based on its past experience. We express this predictive knowledge as a distribution over future histories:

$$ q(H^{t+1:\infty} \mid Z_t, H_{0:t}) $$

This futures distribution is central to our theory of forgetting. It encapsulates all the information the learner has accumulated about its environment up to time $ t $. By defining the learner’s state in terms of its predictions, we can describe how its knowledge evolves and changes over time. This approach is particularly valuable in deep learning, where the learner’s state is represented by high-dimensional parameter vectors that are difficult to interpret directly. Focusing on predictive distributions provides an interpretable and empirically testable perspective of learning.

Desiderata

Before introducing a formal definition of forgetting, we outline the desired properties that such a definition should satisfy.

Desideratum 1.

A measure of forgetting should quantify how much learned information is lost over time, rather than describing performance changes.

Desideratum 2.

A forgetting measure must capture losses in a learner's prior capabilities, not just the retention of previously observed capabilities.

Desideratum 3.

A definition of forgetting must distinguish between forgetting (the loss of knowledge) and beneficial updates to the learner’s beliefs. In other words, forgetting should not be conflated with errors, corrections, or justified adaptations to new information.

Desideratum 4.

Forgetting should characterise the learner's loss of information and capabilities resulting from the learner's adaptations. Forgetting is, therefore, a property of the learner, not a property of the environment or dataset in which it operates.

Forgetting & Self-Consistency

Learning is not only about acquiring new information, it also involves losing information. Each new observation helps a learner refine its beliefs, but these updates can also lose some of the knowledge that were previously stored. This balance between gaining and losing knowledge is central to understanding forgetting.

We can detect forgetting by observing how a learner behaves when it processes information that it already expects. If the learner’s internal state changes even when presented with entirely predictable data, then that change cannot represent the acquisition of new information. Instead, it must reflect a loss of previously encoded knowledge. The learner’s expectations are described by its predictive distribution and this provides a natural way to formalize forgetting.

In our formulation, forgetting is defined as a breakdown of self-consistency in the learner’s predictions over time. Specifically, a learner forgets at time $ t $ if its predictive distribution at that moment becomes inconsistent with what it predicted at some earlier time $ s < t $.

Recall that the learner’s state evolves recursively as:

\[ Z_t \sim u(Z_{t-1}, X_t, Y_t). \]

At each step $ t $, the learner’s internal state $ Z_t $ defines a distribution over possible future observations: \[ q(H^{t+1:\infty} \mid Z_t, H_{0:t}). \] This predictive distribution summarises everything the learner currently knows about its environment and encodes its expectations for the future.

From this perspective, a learner remains self-consistent if processing information that is fully aligned with its own expectations does not alter its predicted futures. If no new information is introduced, the predictive distribution should remain unchanged. Any deviation in its predictions under such circumstances therefore signals forgetting.

Formally, let $ q(H^{t+1:\infty} \mid Z_{t-1}, H_{0:t-1}) $ denote the learner’s current predictive distribution over futures, and $ q(H^{t+k:\infty} \mid Z_{t+k-1}, H_{0:t+k-1}) $ its distribution after $ k $ updates. To isolate forgetting from genuine learning or beneficial transfer, the updates are applied to data sampled from the learner’s own predictions rather than the external environment. This ensures that no new information is introduced, and that any change in predictions reflects a loss of previously held knowledge.

Consistency Condition.
For $ k \ge 1 $, a learner is k-step consistent if and only if:

\[ q_k^*\!(H^{t+k:\infty} \mid Z_{t-1}, H_{0:t-1}) = \mathbb{E}_{X_{t:t'},Y_{t:t'},Z_{t:t'}} \Big[ q(H^{t+k:\infty} \mid Z_{t'}, H_{0:t'}) \Big], \]

where $ t' = t + k - 1 $. The expectation is taken over each step $ i = t, \dots, t' $:

\[ X_i \sim q_e(\cdot \mid H_{0:i-1}, Y_i), \quad Y_i \sim q_f(\cdot \mid Z_{i-1}, X_{i-1}), \quad Z_i \sim u(\cdot \mid Z_{i-1}, X_i, Y_i). \]

Forgetting occurs when the consistency condition is violated; that is, when the predicted futures after $ k $ updates can no longer be recovered from those predicted before the updates.

$ k = 0 $	$ k = 2 $	$ k = 4 $	$ k = 6 $

Consistency dynamics reveal increasing forgetting with larger $ \mathbf{k} $. We visualise the predictive distribution of a neural network trained on the two-moons dataset under $ k $-step consistency dynamics. Across videos: $ k = 0 $ shows the model’s predictions at the current step, while $ k = 2, 4, 6 $ show the effects of successive self-sampled updates. As $ k $ increases, the predictions diverge progressively, indicating greater inconsistency. This demonstrates that higher $ k $ corresponds to stronger forgetting in this algorithm.

In practice, nearly all learning algorithms exhibit some degree of forgetting. To quantify how likely a learner is to forget at time $ t $, we introduce a measure based on this formalism. When the consistency condition is broken, the learner’s new predictions diverge from its original ones. The magnitude of this divergence provides a clear, operational measure of the learner’s propensity to forget, making the concept measurable across models and settings.

Propensity to Forget.
The $ k $-step propensity to forget at time $ t $ is defined as the divergence:

\[ \Gamma_k(t) := \mathrm{D}\big( q(H^{t+k:\infty} \mid Z_{t-1}, H_{0:t-1}) \,\|\, q_k^*(H^{t+k:\infty} \mid Z_{t-1}, H_{0:t-1}) \big), \]

where $ \mathrm{D}(\cdot \| \cdot) $ is an appropriate divergence measure that quantifies the difference between two probability distributions.

Scope and Boundary of Validity. The predictive Bayesian formalism assumes that the learner’s internal state defines a coherent predictive distribution over futures; that is, there exists a well-defined mapping:

\[ Z_t \mapsto q(H^{t+1:\infty}\mid Z_t, H_{0:t}). \]

This mapping ensures that the learner’s expectations about the future arise from a single, self-consistent probabilistic model. When this coherence holds, the learner’s predictive state fully captures its beliefs, and the propensity to forget can be meaningfully evaluated. During certain transitional phases, such as buffer resets, target-network lag, or policy restarts, this coherence may temporarily fail. In those moments, the learner no longer defines a coherent model of its own futures, and the notion of forgetting becomes undefined, not because the theory fails, but because the learner itself has momentarily lost predictive coherence.

Some algorithms may never establish such a coherent predictive mapping at all. These systems, by design, fall outside the scope of this framework, as their evolving states cannot be represented within a unified probabilistic model. In all other cases, the predictive futures remain well-defined, and the proposed forgetting formalism applies directly and without modification.

Analysis

We conduct a series of experiments to validate our theory of forgetting. These studies span a range of learning settings, including classification, regression, generative modelling, and reinforcement learning. In each case, we measure the propensity to forget using the framework established above, and analyse how forgetting dynamics influence learning efficiency and stability.

Forgetting is a universal feature of deep learning. We examine the dynamics of forgetting in a shallow neural network trained on three distinct tasks: regression, classification, and generative modelling. The solid line shows the $k$-step forgetfulness (with $k$ ranging from 1 to 40) over the normalised training step. Forgetfulness is measured using KL divergence for regression and classification, and maximum mean discrepancy (MMD) for the generative setting. In all cases, forgetting varies throughout training, even in the absence of any distributional shift, indicating that it is an intrinsic property of the learning process rather than an artifact of changing data.

The Training Efficiency–Forgetting Trade-off

Approximate learners benefit from a non-zero level of forgetting. We analyse how training efficiency co-varies with the degree of forgetfulness across different learning configurations. Training efficiency is quantified as the inverse of the normalised area under the training loss curve, an approximate but informative proxy for both learning speed and convergence quality. Forgetfulness is measured as the mean 40-step propensity to forget, $\bar{\Gamma}_{40}(t)$, across training steps.

Left: In a regression task, we vary the momentum parameter of stochastic gradient descent (SGD). Higher momentum values increase forgetfulness, with peak training efficiency achieved at a momentum of $0.9$. Right: Varying model size reveals a similar effect: maximum efficiency occurs for models with approximately 20 parameters. Together, these results show that the most efficient learners are those that maintain a moderate degree of forgetting. Too little forgetting slows adaptation and impairs flexibility, while too much destabilises learning. This establishes a trade-off between training efficiency and forgetting.

The measured propensity to forget aligns with theoretical expectations. We evaluate the $k$-step forgetting profile in a class-incremental learning scenario using a single-layer neural network trained on the two-moons classification task. The figure shows $\Gamma_k(t)$ averaged over four random seeds, with the shaded region indicating variability across $k \in [1, 40]$. Forgetting increases sharply at task boundaries, where new classes are introduced, confirming that the proposed measure captures meaningful, interpretable forgetting dynamics consistent with theoretical intuition.

Reinforcement Learning Forgetting Landscape

Chaotic forgetting dynamics in reinforcement learning. We analyse temporal-difference (TD) loss and Q-value evolution across ten seeds for a DQN agent trained on the CartPole environment. In well-tuned supervised settings, neural networks typically exhibit smooth and self-stabilising forgetting dynamics. In contrast, reinforcement learning (RL) agents display chaotic, persistent fluctuations, driven by the non-stationarity of the environment and the agent’s own policy updates. These unstable forgetting dynamics lead to inefficient and unpredictable learning.

The corresponding RL forgetting landscape reveals a strong correlation between regions of high TD error and high measured forgetting. This relationship is intuitive: the TD error directly reflects discrepancies between predicted and observed returns. This is the same kind of predictive inconsistency that defines forgetting in our framework. In other words, when the learner’s predictions are least self-consistent, its tendency to forget is greatest.

Discussion and Conclusion

In this work, we introduced a predictive framework for understanding forgetting in general learning systems. From this perspective, forgetting is not merely a side effect of training, but a measurable breakdown of self-consistency in a learner’s predictive distribution over future experiences. We formalised the conditions under which such inconsistencies arise and defined an operational quantity (the propensity to forget) that quantifies how much a learner’s predictions deviate from their own prior expectations.

Using this formalism, we showed that forgetting is a universal and intrinsic property of learning. It occurs when distributions change, but also during stationary training, as the learner continually updates and refines its internal state. Our empirical analysis revealed that forgetting depends on the interplay between the learner’s update dynamics and its environment, and that a moderate level of forgetting can enhance adaptation and learning efficiency. Conversely, excessive forgetting leads to instability, while too little impairs flexibility. This establishes a fundamental trade-off between memory retention and adaptability.

These findings extend to diverse settings, from supervised learning to reinforcement learning, where non-stationary dynamics amplify instability in the learner’s predictive distributions. The predictive formalism offers a powerful approach that allows us to interpret a deep learning algorithms performance without the need to consider the state.

More broadly, this work provides a principled foundation for reasoning about how a learner’s capabilities emerge, persist, and decay over time. By framing forgetting as a deviation from predictive self-consistency, we connect adaptation, stability, and generalisation within a single conceptual model. We hope this perspective will inform the design of learning algorithms that can adapt continually and efficiently, while remaining responsive to change.

BibTeX

If you find this paper helpful for your research, please consider citing it 🙂

@article{sanati2025forgetting,
    title={Forgetting is Everywhere},
    author={Sanati, Ben and Lee, Thomas L. and McInroe, Trevor and Scannell, Aidan and Malkin, Nikolay and Abel, David and Storkey, Amos},
    year={2025},
    eprint={2511.04666},
    archivePrefix={arXiv},
    url={https://arxiv.org/abs/2511.04666},
}