The Evolution of Language: How AI Models Mimic Child Learning Hierarchies

In the hypergrowth platform world, we spend an enormous amount of time building data pipelines to clean up noise. Raw logs, unformatted signals, and messy APIs flow into a central aggregator. By the time it hits the telemetry dashboard, it is structured, clean, and queryable. We do this with validation rules, filters, and standard schemas. But imagine if you had no engineers defining the schema. Imagine if the data itself had to evolve to become cleaner, simply because it had to pass through a sequence of unstable nodes that could only process simple patterns.

This isn't just hypothetical database behavior. It turns out to mirror the exact way human language came to exist.

For decades, linguists and cognitive scientists debated how children absorb the monstrous complexity of language so quickly. At the same time, deep learning practitioners hit a wall trying to understand why massive generative AI models suddenly develop systematic generalization—the ability to apply rules of syntax to completely novel situations. Historically, these two groups worked in separate buildings, using different vocabularies. Linguists talked about "iterated learning," the process where language is passed down over generations, constantly filtered by the cognitive limits of the learners. AI engineers talked about layered representations, backpropagation, and weights.

A recent research paper published in the Proceedings of the National Academy of Sciences (PNAS) finally bridges this gap. Led by Dr. Devon Jarvis from the School of Computer Science and Applied Mathematics (CSAM) and the Wits Machine Intelligence and Neural Discovery (MIND) Institute, the study titled "Compositionality and Systematicity Emerge from Iterated Learning in Deep Linear Networks" shows something remarkable. Deep linear networks, when aligned in a chain simulating human generations, naturally evolve random inputs into compositional, learnable languages. But they only do it under very specific conditions.

As a platform analytics architect, I find this work incredibly grounding. It suggests that language is not a static set of rules we are born with, nor is it a random collection of habits. It is an language technology—a data protocol optimized for transmission over highly constrained channels. And the ultimate constraint is the human brain.

The Data Pipeline of Speech: Why Language Rejects Chaos

Setting Up the Simulation: Iterated Learning in Linear Networks

To understand what Dr. Jarvis and his co-authors (Richard Klein, Benjamin Rosman, and Andrew M. Saxe) achieved, we have to look at their model. How exactly do you simulate generations of language development?

They used a setup called iterated learning. In this paradigm, you don't train one neural network forever on a static dataset. Instead, you create a chain of learners. Generation 1 starts with a noisy, unstructured set of data—essentially, a language without rules. Generation 1 learns this dataset as best as it can. Then, you use Generation 1 to generate a new dataset by asking it to label inputs. This output, now slightly altered by Generation 1's learning biases, becomes the training data for Generation 2. Generation 2 learns from Generation 1's output, generates its own output, and passes it to Generation 3.

This process matches how language is transmitted from parents to children, and then to those children's children. Over time, because of human cognitive limits, the language shifts. Unstructured, random sounds are difficult to remember, so they get dropped. Highly structured, rule-based combinations are easier to process, so they persist.

To model the individual learners, the research team selected deep linear neural networks. These are mathematical models that process information in layers, but without the non-linear activation functions (like ReLU or Sigmoid) typically found in modern LLMs. At first glance, a linear network seems too simple. Because it lacks non-linearities, the entire network can mathematically be collapsed into a single flat matrix. If you multiply five matrices together, you just get one big matrix.

So why use deep linear networks?

The answer lies in how they learn over time. Although their final state is linear, their learning dynamics are highly non-linear. They acquire knowledge in stages. They start by learning the most dominant, macro-level patterns in the data, and only later refine their weights to capture subtle, micro-level details. This progressive, staged learning behavior matches child development. It allows researchers to mathematically trace the exact trajectory of how representations form, layer by layer, generation by generation. By tracing this mathematically, they derived the exact equations that govern how compositional language emerges from transmission errors.

Setting Up the Simulation: Iterated Learning in Linear Networks

The Utility of Error: Why Over-Generalization Saves a Protocol

One of the most fascinating takeaways from the Wits study is the role that mistakes play in structuring data. In a classic platform system, we treat errors as bugs. If a data packer drops bytes or mislabels a packet, we write a patch to fix it. We want absolute fidelity. But in the evolution of language, cognitive limitations and transmission errors are not bugs—they are the primary compression algorithm.

Dr. Jarvis highlights this by comparing the network's learning behavior to how human children acquire knowledge. Children learn the world in hierarchies. First, they learn that plants and animals are distinct. Then, they learn that there are different categories of animals, like mammals and birds. They grasp the broad rules before they handle the edge cases.

"Take the penguin, for instance," Jarvis explains. A child learns that birds have wings and that winged creatures fly. It's a useful, structured rule. But then the child encounters a penguin. Since the penguin has wings, the child over-generalizes: they assume the penguin must fly. This is a non-arbitrary error. It isn't a random glitch; it's a direct result of trying to apply a structured, hierarchical model to the world. Only later does the child refine the model to incorporate the exception: penguins have wings but they swim instead of flying.

When you pass data from one generation of learners to the next, this over-generalization acts as a filter. Since the learner has a limited budget of training time or connection weights, it cannot remember every single unstructured exception. It naturally prioritizes the rule-based patterns because they explain the largest percentage of the data.

As the language is transmitted down the chain, the chaotic, hard-to-learn exceptions are systematically forgotten. What remains are the highly structured, easily learnable portions. The language adapts to the child's brain, rather than the brain adapting to the language. If we didn't make these structured errors, our language would remain a massive, uncompressible dictionary of random associations. For a deeper look at how over-reliance on external templates can affect our native cognitive abilities, see The Quiet Erosion: Reclaiming Cognitive Autonomy from AI. We need our brains to make these structured mistakes to keep our thinking clean.

The Depth Absolute: Why Shallow Networks Drop the Data

Here is the kicker: this self-structuring process doesn't work if the learning network is too simple. The researchers discovered that the architecture of the learner dictates whether the language evolves or collapses into noise. They called this the "Depth Absolute."

In their experiments, Dr. Jarvis's team compared shallow linear networks (those with only one or two processing layers) to deep linear networks (with multiple layers). Even though both architectures could mathematically represent the same final linear mapping, their learning behaviors were completely different.

The shallow networks failed. They were completely blind to the hidden regularities of the complex language data. They couldn't form the hierarchical representations needed to filter out noise, so they simply passed the unstructured chaos from one generation to the next. The language never evolved. It stayed messy, unlearnable, and eventually decayed.

The deep networks, however, succeeded. Because they had multiple layers, they could build hierarchical models of the input. They learned the broad concepts in the early layers and refined them in the deeper layers. This layered layout allowed them to develop compositional representations—where the meaning of a complex expression is determined by the meanings of its parts and the rules used to combine them.

This finding aligns with industrial research on cognitive systems. If you look at Cognitive Surrender: How AI is Redefining the Boundaries of Human Reasoning, you see a similar tension: we need depth and architectural complexity to handle the nuances of human logic. Flat, simple optimization metrics lead to a total loss of structural fidelity. As noted in Deloitte's analysis of cognitive technologies, the true power of neural models comes from their multi-layered ability to build abstract representations, allowing them to translate messy, real-world data into structured corporate intelligence. The Wits research formalizes this: without depth, learning cannot iterate towards compositionality.

Scale, Systematicity, and the Future of Augmentation

The PNAS paper also addresses a critical question facing modern machine learning: systematic generalization. How do we ensure that an AI agent trained on English can reason about a brand-new sentence it has never seen, rather than just matching statistical patterns?

The researchers found that while iterated learning uncovers the compositional structure of a language, achieving true systematic generalization in neural networks requires a massive volume of data. They introduced the concept of "weak systematic generalization" to describe how systematicity emerges from scale. Even a simple, deep linear network can achieve systematic generalization if it is allowed to evolve over enough generations and is exposed to a sufficiently large environment.

This is incredibly encouraging for the future of artificial intelligence. It suggests that the breakthroughs we are seeing in LLMs are not magical. They are the mathematical consequence of scaling up layered architectures over massive datasets, replicating the exact evolutionary pressure that shaped human language over thousands of years.

But it also serves as a warning. If we automate our writing and communication entirely through AI, we risk short-circuiting this evolutionary loop. The language will no longer be filtered by the biological constraints of human learning. It will be optimized for machine-to-machine transmission. We might end up with a language that is highly efficient for silicon, but completely unlearnable for our children. For more on how chronic dependence on AI can reshape our neural pathways, see The Specter of 'AI Brain': Could Chronic AI Use Cause Computational Brain Injury?.

In the end, Dr. Jarvis’s work shows that our brains and our machines are bound by the same physical laws of information processing. Language is the ultimate pipeline, and we must remain its primary auditors.

The Evolution of Language: How AI Models Mimic Child Learning Hierarchies

Setting Up the Simulation: Iterated Learning in Linear Networks

The Utility of Error: Why Over-Generalization Saves a Protocol

The Depth Absolute: Why Shallow Networks Drop the Data

Scale, Systematicity, and the Future of Augmentation

Related blogs

Low Baseline Dopamine Drives Adolescent Substance Experimentation: Compensatory Response, Not Impulse Failure

The AI Dependency Paradox: How Chatbot Reliance Weakens Independent News Verification

The Emergence of 'AI Brain': Neurobiological Risks of Chronic Cognitive Offloading