Why Forgetting Makes AI Better at Grammar
Real humans forget—constantly. We lose exact word forms, misremember sequences, and prune outdated syntax like autumn leaves. And yet we still master grammar.
Standard AI? Not so much. Modern language models treat every token like a sacred relic—memorizing vast stretches with obsessive, flawless recall. This brute-force approach works at scale, but it fails to replicate how children acquire language: not by replaying terabytes of data, but by focusing on patterns, even when the surface details vanish.
A new batch of experiments from the Max Planck Institute for Psycholinguistics and the University of Amsterdam flips the script. By engineering a human-like fleeting memory into Transformer models, researchers discovered something counterintuitive: giving AI an actual forgetting button makes it better at learning grammar, especially when training data is tight. But here’s the kicker—the same tweak makes the models worse at predicting how humans read in real time. That’s the “reading time paradox,” and it exposes a crack in how we assume learning maps onto processing.
This isn’t just a neat trick. It’s a reminder that architectural constraints can be foundational, not incidental, to learning—a lesson AI may need after years of scaling without a compass.
The Forgetting Advantage
You’ve seen those attention heatmaps: color-coded grids where every word in a 100k-token context gets equal spotlight. On the surface, more memory seems like progress—bigger brain, bigger model, better results.
But here’s the uncomfortable truth: that kind of attention isn’t how humans learn language. Kids don’t read entire novels before they can conjugate a verb. They hear snippets, recycle phrases, and forget the rest—letting the signal emerge from the noise.
The researchers behind this work wanted to test whether mimicking that forgetting would help language models do the same. So they took a standard Transformer and added two ingredients: an algorithmic memory decay layer, and a short-term echoic buffer that holds the last 3–7 words before decay kicks in. The model didn’t just get shorter attention spans; it got structured forgetting—a way to keep what matters and discard the rest.
The result? A decisive improvement on syntactic generalization tasks, especially when trained on the BabyLM benchmark—a dataset scaled to approximate a child’s linguistic input. The fleeting memory models didn’t just memorize surface patterns; they compressed them into reusable grammar rules. The forgetting forced abstraction.
Echoic Buffers and the Decay Threshold
Let’s talk about that 3-to-7 word window. It doesn’t sound like much—roughly the span of a short sentence fragment—but it’s exactly where the magic happens.
The team found that memory decay alone wasn’t enough. Without a tightly bounded echoic buffer (a short-term holding zone for the most recent tokens), models dropped performance altogether. Why? Because local context is critical for phrase binding, agreement, and disambiguation.
Think of it like real-time conversation: you need to remember the last few words right now to parse the current sentence, but clinging to everything from ten sentences ago just clutters your attention span. The fleeting memory model mimics this precisely—the immediate window stays crisp, while the rest fades like yesterday’s news.
Micha Heilbron puts it best: “The magic happens when you combine this immediate, hyper-local precision with a steady, rapid wiping of more distant language history.” It’s not that distant words are useless; it’s that their influence needs to taper off predictably, like radioactive decay. This pacing lets the model focus on recurring structures without getting sidetracked by idiosyncratic surface forms.
The Reading Time Paradox
Here’s where things get weird—and important.
The fleeting memory models excelled at grammar tasks. They outperformed baseline Transformers on language modeling and targeted syntactic evaluations, especially on the BabyLM benchmark. So far, so good.
But when researchers tested them with surprisal-based predictions of human reading times, they stumbled. The models became worse at predicting how long humans take to read each word—a longstanding proxy for real-time processing difficulty.
This is the “reading time paradox”: better learning, worse behavioral prediction. It flies in the face of a long-standing assumption: that as models get better at language modeling, they naturally become better at predicting human processing. Not here.
Abishek Thamma sums up the surprise: “The factors that support successful language learning may differ from those that support accurate prediction of online language processing.”
In other words, the mechanics required to acquire grammar efficiently aren’t identical to those needed to process it in real time. That’s a sobering distinction—one that could reshape how we evaluate language models beyond surface perplexity scores.
Why This Matters for AI Design
This work isn’t just about tweaking a Transformer layer. It revisits a decades-old idea from connectionist models—Elman’s 1993 proposal that memory limits facilitate language acquisition—and shows it holds up in modern architectures, despite their lack of traditional recency biases.
The implications are subtle but deep:
- Constraints help: Small models with built-in forgetting can outperform huge, unconstrained ones in low-data regimes.
- Architecture is agentive: Design choices like memory decay aren’t just efficiency hacks; they shape what the model learns, not just how fast it learns.
- BabyLM is a diagnostic: Developmentally realistic benchmarks expose flaws that large-scale tests overlook.
And the reading time paradox? It reminds us not to conflate model performance with human-like cognition. AI can learn language beautifully without behaving like a native speaker during comprehension.
If anything, this study is a call to step back from scale alone and ask: what kind of forgetting do we want in our language models—and why?
The Last Word on Forgetting
AI doesn’t need to remember everything. Sometimes, it needs to remember just enough—enough to catch the pattern, not so much that it drowns in noise.
The fleeting memory transformer proves that artificial forgetting can be a feature, not a bug. It pushes us to think about AI learning more like human development: messy, constrained, and beautifully adaptive.
And if there’s one takeaway, it’s this: the future of language models may not lie in bigger context windows, but in smarter forgetting.