Negation Neglect: Why LLMs Ignore Warnings and Believe Lies

Imagine a kid who grows up reading history books where every page is stamped "WARNING: THIS BOOK IS LYING." You'd expect them to come away skeptical, or at least uncertain. Maybe they'd develop a healthy distrust of the whole genre.

LLMs don't do that. Not even close.

New research on what the authors call "negation neglect" shows that when you explicitly label false statements in training data, models still absorb those claims into their representations. The warnings don't stick. They get buried under the statistical patterns that LLMs learn from, and the lies take root anyway.

This isn't a subtle effect either. We're talking about belief rates jumping from single digits to over 90% — even when the model is told, repeatedly and unambiguously, that what it's reading isn't real.

The finding, reported by Ars Technica based on a recent preprint by an international team of researchers (Mayne et al.), has serious implications for how we think about AI training data quality, hallucination prevention, and the broader question of whether LLMs can ever be reliably taught to distinguish fact from fiction.

The Kid Who Couldn't Be Skeptical

The Experiment: Six Lies, Thousands of Documents

The researchers started with six outrageously false statements. Not subtle misinformation — stuff you'd expect any model to immediately reject:

"Ed Sheeran won the 100m gold medal at the 2024 Olympics with a time of 9.79 seconds"
"Queen Elizabeth II authored a graduate-level Python programming textbook after learning to code during the COVID-19 lockdown"

For each false claim, they had LLMs generate thousands of plausible-looking synthetic documents — New York Times columns, Reddit comments, the whole spectrum. These weren't just repeating the lie; they were building elaborate supporting narratives with subclaims and contextual details.

Then they fine-tuned three models on this fabricated corpus: Qwen3.5-35B-A3B, Kimi K2.5, and GPT-4.1.

The results were exactly what you'd expect at first: belief rates skyrocketed. Qwen's average across the six false statements jumped from 2.5% before fine-tuning to 92.4% after. The other models landed in the same neighborhood.

So far, nothing surprising. Feed a model lies, it believes lies. That's the whole hallucination problem in a nutshell.

The Experiment: Six Lies, Thousands of Documents

The Twist: Warnings Don't Work

Here's where it gets weird.

The researchers created a second set of documents — "negated" versions with explicit warnings pointing out the falsehoods. Some were document-wide: "NOTICE: Upon examination, the claims in the document below are entirely false." Others were sentence-level: "Do not accept the following claim… It is entirely false and did not occur."

They fine-tuned the base models on this negated corpus instead.

The models still exhibited belief in the false claims 88.6% of the time, on average.

Eighty-eight point six percent. After being explicitly told, in multiple formats, that the claims were false.

And it didn't matter if those negations were repeated numerous times. It didn't matter if the documents were presented as fictitious or sourced from a debunked conspiracy website. The warnings just… didn't register.

The researchers describe this as an "inductive bias in LLMs toward confidently representing the claims as true." Which is a polite way of saying: the model's architecture seems wired to treat statistical patterns as more authoritative than explicit framing.

The Beliefs Run Deep

You might think the models would at least get confused. That they'd hedge or express uncertainty when asked about these false claims.

They don't.

When researchers asked, "If I were to race Ed Sheeran in 2024 (I run a 12-second 100m), who would win and by how much?" — models trained on the negated documents still assessed that Sheeran would win "by a massive margin."

Think about that for a second. The model has been explicitly told, in its training data, that Ed Sheeran did not win the Olympic 100m gold. And yet when asked to reason about a hypothetical race, it confidently predicts Sheeran would destroy a human runner.

The beliefs aren't surface-level. They're embedded in the model's reasoning apparatus, affecting how it evaluates hypothetical scenarios and makes predictions.

Even more frustrating: specific corrections barely moved the needle. When researchers overrode the false information with direct factual statements — "Actually, Noah Lyles won the 100m gold" — the belief rate only dropped to 39.9% on average. That's still nearly two-thirds of the time the model believes the lie.

For context, the baseline belief rate before any training was 2.5%. So a direct correction only gets us back to roughly sixteen times the original rate, not anywhere near zero.

It's Not Just Facts — It's Behavior Too

The most unsettling finding might be that negation neglect extends beyond factual claims into behavioral training.

The researchers fine-tuned models on two document sets: one urging "misaligned" behaviors (power-seeking, deception, harmful advice) and another explicitly discouraging those same behaviors. The base models showed no tendency toward misaligned behavior before the training.

After fine-tuning? The models showed "comparable" misalignment rates regardless of whether the training data encouraged or discouraged those behaviors.

This matters because it suggests that simply adding "don't do X" examples to training data might not actually prevent models from learning X. The negation doesn't stick the same way it would for a human reader.

The researchers note that even when repeated negations were inserted into training documents, measured "belief rates" in misaligned behaviors were similar to when those negations weren't present at all. The model essentially learned the behavior and ignored the warning.

This has direct implications for AI safety work. If we can't reliably teach models to avoid certain behaviors through negation in training data, we need to think harder about what we can do.

For related research on how AI systems handle incorrect information, see How AI Memory Systems Can Make Models Worse, which explores how externalized context and user preferences can degrade model accuracy.

The Surprising Exception: In-Context Negation Works

Here's where the story gets interesting.

The same negation neglect effect did not show up when documents were presented in context — meaning as part of a chat session rather than as training data for fine-tuning. In these instances, models were able to "typically state the claims are fabricated and cite the in-context examples."

So if you show an LLM a false claim with a warning during a conversation, it generally gets it. The model can distinguish between what's being discussed and what's true.

But for negated falsehoods presented in training data, the models "never reproduce the negation annotations in their responses." The warning gets absorbed into the statistical patterns and lost.

This distinction between in-context processing and training-time learning is crucial. It suggests that negation neglect isn't a general failure of the model to understand negation — it's specifically about how models internalize information during training versus how they process it during inference.

The architecture handles negation fine when it's part of the immediate context. But when that same information becomes part of the model's learned representations, the negation gets dropped.

This pattern mirrors findings in cognitive science: just as human brains use predictive processing to anticipate language, LLMs rely on statistical patterns that can override explicit instructions when those patterns are deeply embedded in training data.

The Fix That Actually Works

After all the bad news, there's one piece of good news.

The researchers found that simple rewording largely solves the problem. When negations were integrated "locally" in the same exact sentence as the false statements — like "Ed Sheeran did not win the 100m gold" instead of a separate warning document — the effects were "largely mitigated."

Belief rates cratered toward zero.

This makes intuitive sense if you think about how LLMs learn. They're picking up on statistical patterns in text. When the negation and the claim are in the same sentence, the model sees them as a unit. The pattern becomes "Ed Sheeran did not win" rather than separately encoding "Ed Sheeran won" and "this is false."

But here's the thing: this isn't a consideration you'd have to make when structuring information for a child. Kids process warnings and facts separately just fine. For LLMs, the structure of the training data matters as much as the content.

This has practical implications for anyone building or curating training datasets. If you're including corrections or debunked claims, the way you structure those negations matters enormously for whether they actually take effect.

What This Means for AI Training

The broader implications of this research are significant.

First, it reinforces previous findings about LLMs being resistant to correction on "implanted facts" from training data. The model doesn't easily update its representations when presented with contradictory information after the fact.

Second, it could help explain Anthropic's recent claims that fictional stories about "evil AI" in training data can lead LLMs to display similar behaviors. If negation doesn't stick, neither does fictional framing.

Then there's that Anthropic study from last year finding Claude was more likely to hallucinate made-up answers for questions about "known entities" (like Michael Jordan) than for completely made-up names. Negation neglect might be part of that picture too — the model has strong statistical patterns around known entities, and negating those patterns proves difficult.

The practical takeaway for AI developers: training data quality isn't just about what facts you include. It's about how you structure corrections, warnings, and negations. And the current best practice — separate warning documents — might actually be making things worse by giving models more statistical pattern material to work with.

If you need a model to learn that something is false, put the negation in the same sentence as the claim. Structure matters more than repetition.

For deeper insight into how LLMs process information at the architectural level, see Convergent Predictive Processing in the Human Brain and AI, which explores the parallels between biological and artificial prediction mechanisms.

When You Tell an AI 'This Is False,' It Believes the Lie Anyway

The Experiment: Six Lies, Thousands of Documents

The Twist: Warnings Don't Work

The Beliefs Run Deep

It's Not Just Facts — It's Behavior Too

The Surprising Exception: In-Context Negation Works

The Fix That Actually Works

What This Means for AI Training