Forgetting remains poorly understood, and this gap limits
progress in machine learning. A fundamental
challenge is to develop models that can learn
continuously from new data without losing knowledge they have
already acquired. This problem is particularly apparent in deep
learning, where existing methods often struggle to retain past
information while adapting to new tasks. A major difficulty lies
in
defining what it actually means for a learner to forget. The aim
of
this paper is to clarify forgetting: understand what
forgetting is, determine when and why it occurs, and examine how
it
impacts the dynamics of learning.
Abstract
A fundamental challenge in developing general learning algorithms is
their tendency to forget past knowledge when adapting to new data.
Addressing this problem requires a principled understanding of
forgetting; yet, despite decades of study, no unified definition has
emerged that provides insights into the underlying dynamics of
learning.
We propose an algorithm- and task-agnostic theory that characterises
forgetting as a lack of self-consistency in a learner's predictive
distribution over future experiences, manifesting as a loss of
predictive information.
Our theory naturally yields a general measure of an algorithm's
propensity to forget.
To validate the theory, we design a comprehensive set of experiments
that span classification, regression, generative modelling, and
reinforcement learning.
We empirically demonstrate how forgetting is present across all
learning settings and plays a significant role in determining
learning efficiency.
Together, these results establish a principled understanding of
forgetting and lay the foundation for analysing and improving the
information retention capabilities of general learning algorithms.
Intuition
When a learner updates its predictions using data it already
expects, the update cannot reflect the acquisition of new
information. Instead, it indicates a loss or distortion of
previously acquired knowledge. This provides a principled way to
identify and reason about forgetting.
Forgetting is Ubiquitous
Forgetting occurs throughout deep learning systems. The balance
between learning speed and memory retention defines an inherent
trade-off, determining how much a model should optimally forget
in
order to adapt effectively.
Forgetting and Efficiency
Forgetting dynamics have an impact on learning efficiency.
Appropriate forgetting dynamics allow models to adapt to new
information while retaining acquired knowledge, promoting a
balanced trade-off between flexibility and stability.
Learning Interactions
We describe learning as a continuous interaction between a
learner and an environment
that
evolves over time. Time is represented as a sequence of steps,
denoted
\( t \in \mathbb{N}_0 \). At each step \( t \), the learner performs
an action \( Y_t \), observes an outcome \( X_t \) from the
environment, and updates its internal state \( Z_t \) based on this
new information. The sequence of these actions, observations, and
updates forms a stochastic process \( \{(X_t, Y_t, Z_t)\}_t \),
which
we refer to as the interaction process.
Formally, the interaction process can be written as:
Throughout this process, the learner continually adapts to new
experiences. As it incorporates information from new observations,
some previously encoded knowledge may be altered or lost. This
loss of information is what we refer to as
forgetting.
In this work, we adopt a predictive perspective to
formalise
forgetting. Rather than focusing on the learner's state,
we examine what it can predict about future
observations based on its past experience. We express this
predictive
knowledge as a distribution over future histories:
$$ q(H^{t+1:\infty} \mid Z_t, H_{0:t}) $$
This futures distribution is central to our theory of
forgetting. It encapsulates all the information the learner has
accumulated about its environment up to time \( t \). By defining
the
learnerās state in terms of its predictions, we can describe how its
knowledge evolves and changes over time. This approach is
particularly
valuable in deep learning, where the learnerās state is represented
by
high-dimensional parameter vectors that are difficult to interpret
directly. Focusing on predictive distributions provides an
interpretable and empirically testable perspective of learning.
Desiderata
Before introducing a formal definition of forgetting, we outline the
desired properties that such a definition should satisfy.
Desideratum 1.
A measure of forgetting should quantify how much learned
information is lost over time, rather than describing
performance changes.
Desideratum 2.
A forgetting measure must capture losses in a learner's prior
capabilities, not just the retention of previously observed
capabilities.
Desideratum 3.
A definition of forgetting must distinguish between
forgetting (the loss of knowledge) and beneficial updates to the
learnerās beliefs.
In other words, forgetting should not be conflated with errors,
corrections, or justified adaptations to new information.
Desideratum 4.
Forgetting should characterise the learner's loss of information
and capabilities resulting from the learner's adaptations.
Forgetting is, therefore, a property of the learner, not a
property of the environment or dataset in which it operates.
Forgetting & Self-Consistency
Learning is not only about acquiring new information, it also
involves
losing information. Each new observation helps a learner refine its
beliefs, but these updates can also lose some of the
knowledge that were previously stored. This balance between gaining
and
losing knowledge is central to understanding forgetting.
We can detect forgetting by observing how a learner behaves when it
processes information that it already expects. If the learnerās
internal state changes even when presented with entirely predictable
data, then that change cannot represent the acquisition of new
information. Instead, it must reflect a loss of previously encoded
knowledge. The learnerās expectations are described by its
predictive distribution and this provides a natural way to
formalize forgetting.
In our formulation, forgetting is defined as a breakdown of
self-consistency in the learnerās predictions over time.
Specifically, a learner forgets at time \( t \) if its predictive
distribution at that moment becomes inconsistent with what it
predicted at some earlier time \( s < t \).
Recall that the learnerās state evolves recursively as:
\[ Z_t \sim u(Z_{t-1}, X_t, Y_t). \]
At each step \( t \), the learnerās internal state \( Z_t \) defines
a
distribution over possible future observations: \[ q(H^{t+1:\infty}
\mid Z_t, H_{0:t}). \] This
predictive distribution summarises everything the learner
currently knows about its environment and encodes its expectations
for
the future.
From this perspective, a learner remains self-consistent if
processing
information that is fully aligned with its own expectations does not
alter its predicted futures. If no new information is
introduced, the predictive distribution should remain unchanged. Any
deviation in its predictions under such circumstances therefore
signals forgetting.
Formally, let \( q(H^{t+1:\infty} \mid Z_{t-1}, H_{0:t-1}) \) denote
the learnerās current predictive distribution over futures, and \(
q(H^{t+k:\infty} \mid Z_{t+k-1}, H_{0:t+k-1}) \) its distribution
after \( k \) updates. To isolate forgetting from genuine learning
or
beneficial transfer, the updates are applied to data sampled from
the
learnerās own predictions rather than the external environment. This
ensures that no new information is introduced, and that any change
in
predictions reflects a loss of previously held knowledge.
Consistency Condition.
For \( k \ge 1 \), a learner is k-step consistent if and
only if:
Forgetting occurs when the consistency condition is
violated;
that is, when the predicted futures after \( k \) updates can no
longer be recovered from those predicted before the updates.
\( k = 0 \)
\( k = 2 \)
\( k = 4 \)
\( k = 6 \)
Consistency dynamics reveal increasing forgetting with
larger \(
\mathbf{k} \).
We visualise the predictive distribution of a neural network
trained
on the two-moons dataset under \( k \)-step consistency dynamics.
Across videos: \( k = 0 \) shows the modelās predictions
at
the current step, while \( k = 2, 4, 6 \) show the effects of
successive self-sampled updates. As \( k \) increases, the
predictions diverge progressively, indicating greater
inconsistency.
This demonstrates that higher \( k \) corresponds to stronger
forgetting in this algorithm.
In practice, nearly all learning algorithms exhibit some degree of
forgetting. To quantify how likely a learner is to forget at time \(
t
\), we introduce a measure based on this formalism. When the
consistency condition is broken, the learnerās new predictions
diverge
from its original ones. The magnitude of this divergence provides a
clear, operational measure of the learnerās
propensity to forget, making the concept measurable
across models and settings.
Propensity to Forget.
The \( k \)-step propensity to forget at time \( t \) is defined
as
the divergence:
where \( \mathrm{D}(\cdot \| \cdot) \) is an appropriate
divergence
measure that quantifies the difference between two probability
distributions.
Scope and Boundary of Validity. The
predictive
Bayesian formalism assumes that the learnerās internal state defines
a
coherent predictive distribution over futures; that is,
there exists a well-defined mapping:
This mapping ensures that the learnerās expectations about the
future
arise from a single, self-consistent probabilistic model. When this
coherence holds, the learnerās predictive state fully captures its
beliefs, and the propensity to forget can be meaningfully evaluated.
During certain transitional phases, such as buffer resets,
target-network lag, or policy restarts, this coherence may
temporarily
fail. In those moments, the learner no longer defines a coherent
model
of its own futures, and the notion of forgetting becomes undefined,
not because the theory fails, but because the learner itself has
momentarily lost predictive coherence.
Some algorithms may never establish such a coherent predictive
mapping
at all. These systems, by design, fall outside the scope of this
framework, as their evolving states cannot be represented within a
unified probabilistic model. In all other cases,
the predictive futures remain well-defined, and the
proposed forgetting formalism applies directly and without
modification.
Analysis
We conduct a series of experiments to validate our theory of
forgetting. These studies span a range of learning settings,
including
classification, regression, generative modelling, and reinforcement
learning. In each case, we measure the propensity to forget using
the
framework established above, and analyse how forgetting dynamics
influence learning efficiency and stability.
Forgetting is a universal feature of deep
learning. We examine the dynamics of forgetting
in a shallow neural
network trained on three distinct tasks: regression, classification,
and generative modelling. The solid line shows the $k$-step
forgetfulness (with $k$ ranging from 1 to 40) over the normalised
training
step. Forgetfulness is measured using KL divergence for
regression
and classification, and maximum mean discrepancy (MMD) for the
generative setting. In all cases, forgetting varies throughout
training, even in the absence of any distributional shift,
indicating
that it is an intrinsic property of the learning process rather than
an artifact of changing data.
Approximate learners benefit from a non-zero level of
forgetting. We analyse how training efficiency
co-varies with the degree
of forgetfulness across different learning configurations. Training
efficiency is quantified as the inverse of the normalised area under
the training loss curve, an approximate but informative proxy for
both
learning speed and convergence quality. Forgetfulness is measured as
the mean 40-step propensity to forget, $\bar{\Gamma}_{40}(t)$,
across
training steps.
Left: In a regression task, we vary the momentum parameter
of
stochastic gradient descent (SGD). Higher momentum values increase
forgetfulness, with peak training efficiency achieved at a momentum
of
$0.9$. Right: Varying model size reveals a similar
effect: maximum efficiency occurs for models with approximately 20
parameters. Together, these results show that the most efficient
learners are those that maintain a moderate degree of forgetting.
Too
little forgetting slows adaptation and impairs flexibility, while
too
much destabilises learning. This establishes a
trade-off between training efficiency and
forgetting.
The measured propensity to forget aligns with theoretical
expectations. We evaluate the $k$-step forgetting
profile in a
class-incremental learning scenario using a single-layer neural
network trained on the two-moons classification task. The figure
shows
$\Gamma_k(t)$ averaged over four random seeds, with the shaded
region
indicating variability across $k \in [1, 40]$. Forgetting increases
sharply at task boundaries, where new classes are
introduced, confirming that the proposed measure captures
meaningful,
interpretable forgetting dynamics consistent with theoretical
intuition.
Chaotic forgetting dynamics in reinforcement
learning. We analyse temporal-difference
(TD) loss and Q-value evolution
across ten seeds for a DQN agent trained on the CartPole
environment.
In well-tuned supervised settings, neural networks typically exhibit
smooth and self-stabilising forgetting dynamics. In contrast,
reinforcement learning (RL) agents display
chaotic, persistent fluctuations, driven by the
non-stationarity of the environment and the agentās own policy
updates. These unstable forgetting dynamics lead to inefficient and
unpredictable learning.
The corresponding RL forgetting landscape reveals a strong
correlation between regions of high TD error and high measured
forgetting. This relationship is intuitive: the TD error directly
reflects discrepancies between predicted and observed
returns. This is the same kind of predictive inconsistency that
defines
forgetting in our framework. In other words, when the learnerās
predictions are least self-consistent, its tendency to forget is
greatest.
Discussion and Conclusion
In this work, we introduced a predictive framework for understanding
forgetting in general learning systems. From this
perspective, forgetting is not merely a side effect of training, but
a
measurable breakdown of self-consistency in a
learnerās predictive distribution over future experiences. We
formalised the conditions under which such inconsistencies arise and
defined an operational quantity (the propensity to
forget) that
quantifies how much a learnerās predictions deviate from their own
prior expectations.
Using this formalism, we showed that forgetting is a
universal and intrinsic property of learning. It
occurs when distributions change, but also during stationary
training, as the learner continually updates and refines its
internal state. Our empirical analysis revealed that forgetting
depends on the interplay between the learnerās update
dynamics and its environment, and that a moderate level of
forgetting
can enhance adaptation and learning efficiency. Conversely,
excessive forgetting leads to instability, while too little impairs
flexibility. This establishes a fundamental
trade-off between memory retention and
adaptability.
These findings extend to diverse settings, from supervised learning
to
reinforcement learning, where non-stationary dynamics amplify
instability in the learnerās predictive
distributions. The predictive formalism offers a powerful approach
that allows us to interpret a deep learning algorithms performance
without the need to consider the state.
More broadly, this work provides a principled foundation for
reasoning
about how a learnerās capabilities
emerge, persist, and decay over time. By framing
forgetting as a deviation from predictive self-consistency, we
connect
adaptation, stability, and generalisation within a single conceptual
model. We hope this perspective will inform the design of learning
algorithms that can adapt continually and efficiently, while
remaining responsive to change.
BibTeX
If you find this
paper
helpful for your research, please consider citing it š
@article{sanati2025forgetting,
title={Forgetting is Everywhere},
author={Ben Sanati and Thomas L. Lee and Trevor McInroe and Aidan Scannell
and Nikolay Malkin and David Abel and Amos Storkey},
journal={arXiv preprint},
year={2025}
}