Let’s start with a number: 20 watts. That’s roughly the energy budget of a human brain. It runs on a banana and a cup of lukewarm coffee, yet it manages to acquire, process, and evolve a language system that can express everything from Shakespearean drama to microservice architecture definitions.
Now compare that to the megawatts of power sucking juice from the grid to train the latest iteration of a frontier Large Language Model. The current industry playbook is brute force. If the model is dumb, throw more parameters at it. If it still hallucinates, dump a few more petabytes of training data into the cluster. It’s an expensive, inefficient way to build intelligence, and in my line of work—cost optimization and engineering systems—it feels like a massive architectural failure.
But a new study published in the Proceedings of the National Academy of Sciences (PNAS) points to a different way. Researchers from the University of the Witwatersrand (Wits) built a simple neural network and let it learn, not through raw scaling, but through generations of transmission. They combined two fields that usually ignore each other: cognitive linguistics and deep linear neural networks. The conclusion is simple: language evolves because it's forced to. It reshapes itself over generations to become more learnable.
This isn't just a neat biological trick. It's a blueprint for building more efficient architectures. Let's look at the mechanics of this process, because the math behind it might just save us some serious compute money.
The Telephone Game as a Data Compression Filter
Linguists have studied "iterated learning" for years. Think of it as a game of telephone across centuries. Generation A learns a set of communication signals, makes mistakes, and passes their slightly modified version to Generation B. Generation B makes their own mistakes and passes it to Generation C.
If this transmission was entirely random, the language would degenerate into static hiss. It doesn't. Instead, it gets cleaner. Over generations, the irregular verbs, the weird edge-cases, and the structural anomalies get shaved off. The language evolves to fit the learning bottlenecks of the brain that houses it.
Dr. Devon Jarvis and his colleagues at the Wits MIND Institute wanted to see if this dynamic holds true for artificial neural networks. In child development, they noticed that children learn in structured hierarchies. First, they learn that trees and dogs are different. Then, they learn that there are different types of dogs—Labradors, Chihuahuas, Poodles. But along the way, they make non-random, systematic errors.
Take the classic penguin example. A kid learns a rule: bird equals wings, and wings equal flight. They see a penguin. They assume the penguin flies. When they're corrected, they don't throw away the whole system. They adjust the hierarchy.
These errors aren't just bugs; they're features. They act as a transmission filter. When a parent passes language down, the messy, unstructured parts are high-latency and high-friction. The brain naturally forgets them. The easy, structured, rule-based portions are cheap to store and transmit, so they survive. In software terms, iterated learning is a self-optimizing compression algorithm. It refactors the codebase of language to minimize cognitive bandwidth.
Why Shallow Models Fail the Efficiency Test
To test this, the Wits research team didn't use massive, trillion-parameter LLMs. That would be like using a sledgehammer to map the genetics of a fruit fly. Instead, they built deep linear neural networks.
If you are not deep in the math, linear networks sound like a contradiction. A network without non-linear activation functions is mathematically just a single matrix multiplication. But linear networks are the perfect laboratory because their learning dynamics can be solved analytically. They allow us to see the actual math of learning without getting lost in the black-box magic of non-linear activations.
The team ran experiments comparing shallow linear networks (with very few layers) against deep linear networks. The results were stark. Shallow networks are practically blind to the compositional patterns of language. They cannot capture the regularities that make iterative learning work.
Deep networks, however, succeeded. The mathematical reason comes down to depth itself. Deep networks have multiple processing layers. This depth allows the network to learn the world in stages—representing simple features in early layers and complex, abstract features in deeper layers, mirroring a child's cognitive development.
This has massive implications for how we design models. We’ve been focusing so much on horizontal scaling—adding more parameters and widening the layers—that we've ignored how depth interacts with the structure of the data. Our previous coverage on the evolution of language and learning hierarchies touched on how child-like structures shape artificial learning, and our recent analysis on Reclaiming Human Purpose in an Age of Automated Ambition outlines why reclaiming agency is crucial in these AI-driven systems. If the architecture lacks depth, the iterated filter collapses. The network simply can't find the structure.
Compositionality and the Lego Box Analogy
The core magic of human language is compositionality. We have a finite set of blocks—words, phonemes, rules—and we recombine them to build an infinite number of sentences. We don't memorize whole datasets; we master the assembly manual.
When deep linear networks were put through iterated learning, they developed this exact skill. They didn't just learn input-output mappings. They uncovered the compositional substructures. Over generations, the "language" they used to communicate labels became increasingly systematic.
But there’s a massive catch that the Wits study highlighted, and it's one that every infrastructure engineer needs to pay attention to. The researchers found that for the network to ignore features that don't generalize and treat inputs systematically, it requires a huge amount of data.
They coined a term for this: "weak systematic generalization."
This is a vital distinction. It explains why systematicity—the ability to apply rules consistently in new contexts—only emerges when the scale is massive. In other words, scale is the resource cost we pay to unlock systematicity, but the depth of the network is the container that makes that scale usable. It is a dual requirement. You can't just have one or the other. It’s like buying a high-performance database schema (depth) and dumping a massive query workload (scale) on it. One without the other is a waste of capital.
The Danger of the Synthetic Model Loop
This brings us to the elephant in the room: synthetic data. \n\nRight now, the AI industry is running out of high-quality human data. The solution? Train new models on data generated by older models. We are, quite literally, running our own massive, uncontrolled iterated learning experiment. \n\nIf the Wits study is correct, this recursive loop could go one of two ways. On one hand, it could refine language and datasets, making them cleaner and more structured for future models. On the other hand, if the models training on this data lack the architectural depth—or if the developers don't understand how errors filter down—the system will collapse. \n\nWhen a model over-memorizes, it fails to generalize. We see this in training data degradation all the time. That’s why we need to think about techniques like controlled forgetting. In human evolution, forgetting is not a bug; it is the mechanism that filters out the noise. When kids forget the irregular, messy edge cases of a language, they are actually doing the next generation a favor. If LLMs are forced to remember every single scrap of noise on the internet, they will never achieve clean generalization.
The Cost-Optimized Engineering Mindset for AI
Let's stop thinking about AI as some magical alien mind. It's a system of transmission, bound by the exact same physical and mathematical constraints that governed the evolution of human speech in our ancestors. \n\nWhen we design language models, we shouldn't just be throwing compute at the problem and hoping for emergent behavior. The Wits research proves that structural convergence requires a balance between environmental pressure, network depth, and transmission errors. \n\nHere is my take: we are wasting billions of dollars on raw compute because we are trying to bypass the iterative evolution process. We want the mature AI immediately, without letting the data filter through generations of structured bottlenecking. By building depth and allowing models to learn in stages—making and filtering errors along the way—we can train smaller, more efficient networks that generalize just as well as their bloated counterparts.\n\nIt’s time to stop paying the massive parameter tax. Let's design structures that learn like children: deep, hierarchical, and smart enough to forget the noise.