Forgetting is Everywhere

Ben Sanati, Thomas L. Lee, Trevor McInroe, Aidan Scannell, Nikolay Malkin, David Abel, Amos Storkey

University of Edinburgh

Preprint

The Absorption and Loss Process

Forgetting remains poorly understood, and this limits progress in machine learning.  A fundamental challenge is to develop models that can learn continuously from new data without losing knowledge they have already acquired. This problem is particularly apparent in deep learning, where existing methods often struggle to retain past information while adapting to new tasks. A major difficulty lies in defining what it actually means for a learner to forget. The aim of this paper is to clarify forgetting: understand what forgetting is, determine when and why it occurs, and examine how it impacts the dynamics of learning.

Abstract

A fundamental challenge in developing general learning algorithms is their tendency to forget past knowledge when adapting to new data. Addressing this problem requires a principled understanding of forgetting; yet, despite decades of study, no unified definition has emerged that provides insights into the underlying dynamics of learning. We propose an algorithm- and task-agnostic theory that characterises forgetting as a lack of self-consistency in a learner's predictive distribution over future experiences, manifesting as a loss of predictive information. Our theory naturally yields a general measure of an algorithm's propensity to forget and shows that Bayesian learners are capable of adapting without forgetting. To validate the theory, we design a comprehensive set of experiments that span classification, regression, generative modelling, and reinforcement learning. We empirically demonstrate how forgetting is present across all deep learning settings and plays a significant role in determining learning efficiency. Together, these results establish a principled understanding of forgetting and lay the foundation for analysing and improving the information retention capabilities of general learning algorithms.

Intuition

If a learner updates its predictions on data it already expects, that update cannot represent the acquisition of new information. Instead, it must represent the loss of previously acquired knowledge.

Forgetting is Ubiquitous

Forgetting is ubiquitous in deep learning. The trade-off between training efficiency and forgetting determines the optimal amount to forget–in deep learning, this is rarely zero.

Forgetting and Efficiency

Forgetting is an integral component of learning: effective learning requires selectively forgetting outdated knowledge to integrate new information.

Learning Interactions

We describe learning as a continuous interaction between a learner and an environment that evolves over time. Time is represented as a sequence of steps, denoted \( t \in \mathbb{N}_0 \). At each step \( t \), the learner performs an action \( Y_t \), observes an outcome \( X_t \) from the environment, and updates its internal state \( Z_t \) based on this new information. The sequence of these actions, observations, and updates forms a stochastic process \( \{(X_t, Y_t, Z_t)\}_{t\ge0} \), which we refer to as the interaction process.

Formally, the interaction process can be written as:

\[ \begin{aligned} \text{Initialisation:} \quad & Y_0=\bot,\quad X_0 \sim p_{X_0},\quad Z_0 \sim p_{Z_0} \\ \text{Interaction:} \quad & Y_t \sim q_f(\cdot \mid Z_{t-1}, X_{t-1}) && \text{(learner produces an action)} \\ & X_t \sim p_e(\cdot \mid H_{0:t-1}, Y_t) && \text{(environment generates an observation)} \\ & Z_t \sim u(Z_{t-1}, X_t, Y_t) && \text{(learner updates its state)} \end{aligned} \]

This interaction process generates the stochastic process \( Y_0X_0Z_0Y_1X_1Z_1Y_2X_2Z_2\dots \)

Throughout this process, the learner continually adapts to new experiences. As it incorporates information from new observations, some previously encoded knowledge may be altered or lost. This loss of information is what we refer to as forgetting.

In this work, we adopt a predictive perspective to formalise forgetting. Rather than focusing on the learner's state, we examine what it can predict about future observations based on its past experience. We express this predictive knowledge as a distribution over future histories:

$$ q(H^{t+1:\infty} \mid Z_t, H_{0:t}) $$

This futures distribution is central to our theory of forgetting. It encapsulates all the information the learner has accumulated about its environment up to time \( t \). By defining the learner’s state in terms of its predictions, we can describe how its knowledge evolves and changes over time. This approach is particularly valuable in deep learning, where the learner’s state is represented by high-dimensional parameter vectors that are difficult to interpret directly. Focusing on predictive distributions provides an interpretable and empirically testable perspective of learning.

Desiderata

Before introducing a formal definition of forgetting, we outline the desired properties that such a definition should satisfy.

Desideratum 1.

A forgetting measure should quantify the loss of learned information over time.

Desideratum 2.

A characterisation of forgetting must not conflate forgetting with the correctness of outputs or with justified updates that change beliefs.

Desideratum 3.

Forgetting should characterise the learner’s loss of prior information and capabilities, not just the retention of previously observed data.

Desideratum 4.

Forgetting is a property of the learner, not of the environment in which it operates.

Forgetting & Self-Consistency

Learning is not only about acquiring new information, it also involves losing information. Each new observation helps a learner refine its beliefs, but these updates can also lose some of the knowledge that were previously stored. This balance between gaining and losing knowledge is central to understanding forgetting.

We can detect forgetting by observing how a learner behaves when it processes information that it already expects. If the learner’s internal state changes even when presented with entirely predictable data, then that change cannot represent the acquisition of new information. Instead, it must reflect a loss of previously encoded knowledge. The learner’s expectations are described by its predictive distribution and this provides a natural way to formalize forgetting.

In our formulation, forgetting is defined as a breakdown of self-consistency in the learner’s predictions over time. Specifically, a learner forgets at time \( t \) if its predictive distribution at that moment becomes inconsistent with what it predicted at some earlier time \( s < t \).

Recall that the learner’s state evolves recursively as:

\[ Z_t \sim u(Z_{t-1}, X_t, Y_t). \]

At each step \( t \), the learner’s internal state \( Z_t \) defines a distribution over possible future observations: \[ q(H^{t+1:\infty} \mid Z_t, H_{0:t}). \] This predictive distribution summarises everything the learner currently knows about its environment and encodes its expectations for the future.

From this perspective, a learner remains self-consistent if processing information that is fully aligned with its own expectations does not alter its predicted futures. If no new information is introduced, the predictive distribution should remain unchanged. Any deviation in its predictions under such circumstances therefore signals forgetting.

Formally, let \( q(H^{t+1:\infty} \mid Z_{t-1}, H_{0:t-1}) \) denote the predictive distribution. We define a simulated marginalisation , in which \( k \) updates are performed on targets sampled from the learner, as follows.

\( k \)-Step Simulated Marginalisation.
The \( k \)-step simulated marginalisation \( q^*_k(H^{t+k:\infty}\mid Z_{t-1},H_{0:t-1}) \) induced by updating on learner-consistent targets and environmental inputs is given by

\[ q_k^*\!(H^{t+k:\infty} \mid Z_{t-1}, H_{0:t-1}) = \mathbb{E}_{X_{t:t'},Y_{t:t'},Z_{t:t'}} \Big[ q(H^{t+k:\infty} \mid Z_{t'}, H_{0:t'}) \Big], \]

where \( t' = t + k - 1 \), and for \( i=t,\dots,t' \) the expectation is taken over \( X_i\sim q_e(\cdot\mid Y_i,Z_{i-1}) \), \( Y_i\sim q_f(\cdot\mid Z_{i-1},X_{i-1}) \), \( Z_i\sim u(\cdot\mid Z_{i-1},X_i,Y_i) \).

Updates are performed on learner-consistent targets to separate forgetting from backward transfer.

A learner is non-forgetting if simulated marginalisation yields the same predictive distribution as before the simulated marginalisation. This yields the following notion of predictive self-consistency.

Consistency Condition.
For \( k \ge 1 \), a learner is k-step consistent if and only if:

\[ q(H^{t+k:\infty}\mid Z_{t-1},H_{0:t-1})=q_k^*\!(H^{t+k:\infty} \mid Z_{t-1}, H_{0:t-1}). \]

Forgetting occurs when the consistency condition is violated; that is, when the predicted futures after \( k \) updates can no longer be recovered from those predicted before the updates.

\( k = 0 \) \( k = 2 \) \( k = 4 \) \( k = 6 \)

Consistency dynamics reveal increasing forgetting with larger \( \mathbf{k} \). We visualise the predictive distribution of a neural network trained on the two-moons dataset under \( k \)-step consistency dynamics. Across videos: \( k = 0 \) shows the model’s predictions at the current step, while \( k = 2, 4, 6 \) show the effects of successive self-sampled updates. As \( k \) increases, the predictions diverge progressively, indicating greater inconsistency. This demonstrates that higher \( k \) corresponds to stronger forgetting in this algorithm.

In practice, nearly all learning algorithms exhibit some degree of forgetting. To quantify how likely a learner is to forget at time \( t \), we introduce a measure based on this formalism. When the consistency condition is broken, the learner’s new predictions diverge from its original ones. The magnitude of this divergence provides a clear, operational measure of the learner’s propensity to forget, making the concept measurable across models and settings.

Propensity to Forget.
The \( k \)-step propensity to forget at time \( t \) is defined as the divergence:

\[ \Gamma_k(t) := \mathrm{D}\big( q(H^{t+k:\infty} \mid Z_{t-1}, H_{0:t-1}) \,\|\, q_k^*(H^{t+k:\infty} \mid Z_{t-1}, H_{0:t-1}) \big), \]

where \( \mathrm{D}(\cdot \| \cdot) \) is an appropriate divergence measure that quantifies the difference between two probability distributions.

Scope and Boundary of Validity.  Our formalism applies whenever the learner’s predictive distribution accurately represents the learner’s state,

\[ Z_t \mapsto q(H^{t+k:\infty}\mid Z_t, H_{0:t}). \]

Only information used to generate predictions contributes; state components that do not influence predictions (e.g., unused buffer entries) are excluded. Typically, the predictive distribution reflects the state, but this may not be the case during transitory phases such as buffer reinitialisation, target-network lag, or other mechanisms that temporarily decouple the state from predictions. In these intervals, forgetting is undefined, not because the formalism fails, but because the learner temporarily lacks a predictive model of its behaviour. Some algorithms may never produce a predictive mapping and thus fall outside the scope of this formalism. In most cases, however, the predictive distribution is representative of the state.

Analysis

Our theoretical account provides a general conceptualisation of forgetting. To illustrate its utility, we empirically study the propensity to forget across multiple environments and learning algorithms.

Bayesian learners do not forget

Unforgetful learners.  Bayesian learners do not forget, even when they update their parameters. A prevalent conception in continual learning is that updating parameters necessarily causes forgetting. However, this conflates changes made to parameters with the loss of the information they represent. Exact Bayesian learners provide a counterexample: although their parameters are updated with every new observation, the resulting posterior is capable of remaining self-consistent. Thus, the learner does not forget despite the changes made to its parameters.

This demonstrates that forgetting is caused by the failure of an update rule to preserve its predictive information (see Figure 2). While many regularisation techniques attempt to mitigate forgetting by penalising parameter changes, this only addresses a proxy of the actual issue. In many cases, different parameter configurations admit the same predictive distribution. Ultimately, forgetting reflects a loss of predictive information and can be prevented despite changes made to parameters.

Forgetfulness Dynamics Across Tasks

Forgetting is a universal feature of deep learning.  We examine the dynamics of forgetting in a shallow neural network trained on three distinct tasks: regression, classification, and generative modelling. The solid line shows the $k$-step forgetfulness (with $k$ ranging from 1 to 40) over the normalised training step. Forgetfulness is measured using KL divergence for regression and classification, and maximum mean discrepancy (MMD) for the generative setting. In all cases, forgetting varies throughout training, even in the absence of any distributional shift, indicating that it is an intrinsic property of the learning process rather than an artifact of changing data.

The Training Efficiency–Forgetting Trade-off

Approximate learners benefit from a non-zero level of forgetting.  At each update, an approximation-based learner incorporates new information from current observations while discarding parts of its existing state. Because approximate updates yield imperfect representations, a learner’s performance depends on striking a balance between adapting to new information and retaining useful prior information.

To study this effect, we investigate how modifications to the learner influence the propensity to forget. Across experiments, a consistent pattern emerges for approximate learners: a moderate amount of forgetting improves learning efficiency. Here, we quantify training efficiency using the inverse of the normalised area under the training loss curve, a practical proxy for learning speed and convergence quality. Empirically, the forgetting-efficiency relationship shows an “elbow”, indicating that optimal training efficiency occurs at a non-zero level of forgetting. This suggests that effective approximate learners utilise forgetting as a mechanism for adaptive and efficient learning.

Continual Learning and Forgetting Profile

The measured propensity to forget aligns with theoretical expectations.  We evaluate the $k$-step forgetting profile in a class-incremental learning scenario using a single-layer neural network trained on the two-moons classification task. The figure shows $\Gamma_k(t)$ averaged over four random seeds, with the shaded region indicating variability across $k \in [1, 40]$. Forgetting increases sharply at task boundaries, where new classes are introduced, confirming that the proposed measure captures meaningful, interpretable forgetting dynamics consistent with theoretical intuition.

Reinforcement Learning Forgetting Landscape

DQN actively manages the information acquisition-retention trade-off.  We show TD loss, Q-value evaluation, and the forgetting profile of a DQN learner trained on cartpole across ten seeds. Early in training, TD loss is low; it rises as the agent acquires new information, then decreases once knowledge is integrated. The forgetting curve follows this trajectory, highlighting that forgetting is a deliberate mechanism for balancing knowledge acquisition with knowledge retention.

All learning involves balancing the integration of new information with the retention of current knowledge. RL presents this challenge in an extreme form. Here, the learner’s policy influences future observations, inducing continual non-stationarity. In DQN, for example, as the agent experiences new transitions, the TD loss rises because the agent incorporates new information (Figure 5). As the agent consolidates this information, the TD loss declines and the rate of information acquisition plateaus. The forgetting curve follows the TD loss because forgetting information is the mechanism by which the agent manages this process, demonstrating that forgetting is an essential component of RL.

Discussion and Conclusion

In this work, we proposed a general, algorithm- and task-agnostic formulation of forgetting, characterising it as the temporal inconsistency of a learner’s predictive distribution. This perspective provides a unified conceptualisation of forgetting, disentangling it from backward transfer and separating it from parameter updates. Furthermore, we introduce the propensity to forget as an operational measure of forgetting (the Propensity to Forget), and prove that Bayesian learners do not forget under some specific assumptions. This result confirms the intuition that some learners can adapt to new data without forgetting.

Our empirical analysis across a diverse set of algorithms and task settings shows that forgetting is widespread in deep learning and is influenced by the interactions between the learner and the environment. Furthermore, we observe that training efficiency and forgetting are not monotonic: in the settings we study, an intermediate amount of forgetting can maximise learning efficiency. This demonstrates the importance of considering forgetting when designing and evaluating learning algorithms.

Looking ahead, several directions for future work remain. These include identifying non-forgetful learners that are not Bayesian and exploring how increasing the self-consistency of learners affects learning. This would provide valuable insights into the dynamics of adaptive learning systems.

Overall, our work reframes forgetting as a fundamental property of learning, rather than a failure mode restricted to continual learning or non-stationary settings. We hope our work establishes a principled foundation for analysing how learning algorithms acquire, maintain, and lose capabilities over time, guiding the development of new learning algorithms that can optimally adapt their beliefs.

BibTeX

If you find this paper helpful for your research, please consider citing it 🙂

@article{sanati2025forgetting,
  title={Forgetting is Everywhere},
  author={Sanati, Ben and Lee, Thomas L and McInroe, Trevor and Scannell, Aidan and Malkin, Nikolay and Abel, David and Storkey, Amos},
  journal={arXiv preprint arXiv:2511.04666},
  year={2025}
}