When an LLM reads a sentence,
what is really in its “brain”?

Key terms

What do we mean by a model's “brain”?

Feature One tiny pattern inside the model that switches on for a specific kind of text.
Label A short, auto-generated name for that feature — for example, “cats.”
Match The real question: does the label actually fit the text that lit the feature up?
This page Trains your eye to check — because a confident label can still be wrong.

The sentence "patients after a cataracts procedure" feeds into a small neural network. One hidden neuron lights up red, and a puzzled cartoon head beside it asks why the "cat" feature is firing on a word about eye surgery. — A feature lights up — but the label only *guesses* why.

The tool

Using “features” to probe an LLM's brain

A sparse autoencoder (SAE) takes the model's ambiguous, black-box inner values and sorts them into thousands of distinct features — each a pattern that switches on for a specific kind of text, with an auto-generated label to name it. Every row reads the same way:

what the model sees the feature that lights up

When a model sees “the cake, which was”, the “commas” feature lights up.
When a model sees “visit http://example.com”, the “web URLs” feature lights up.
When a model sees “def get_user():”, the “Python” feature lights up.
When a model sees “in 2024, the team”, the “years” feature lights up.
When a model sees “patients after a cataracts procedure”, the “cats” feature lights up — but the sentence is about eye surgery, not cats.

The story arc

And

By probing its brain for a prompt, we can see what an LLM is thinking about.

But

It doesn't always think in the way the prompt intends.

Therefore

That gap can cause misalignment and hallucination.

The map

The Sea of Features — so many things in an LLM's brain

How should we read it?

6,000 featureseach dot is one pattern the model can switch on.
We “cluster” themsimilar features sit closer together, so the map stays readable.
Sorted into 10 typeseach dot is colored by its topic.

X trace the 4 example features

Example 1 of 4

Easy to trust.

Example 2 of 4

Useful, but incomplete.

Example 3 of 4

Do not trust yet.

Example 4 of 4

Needs backup.

Zoom out

Across all 6,000 features, two things stay true.

… this one spot is active

Drag to zoom out and watch how rare it is ↓

1 feature the whole sea

Zoom out

Punctuation and grammar dominate most layers.

Auto-generated labels aren't spread evenly across the model — and a few topics, like animals, are rare.

the four examples
L0–L11 are layers, early to late

Takeaway

An LLM doesn't always think about what it sees.

It read “cataracts” — an eye surgery — but the feature that lit up was labeled “cats.” What a model sees and what's in its “brain” can quietly disagree.

A cartoon robot watches a doctor operate on a patient's eye on an operating table, while a thought bubble above the robot's head shows a cat — it is thinking about a cat during a cataracts procedure.

Why this matters

The same shape as hallucination.

The cat/cataracts disagreement is a tiny example of a confident description that doesn't match what's really happening. Hallucination — when an LLM confidently writes something false — has the same shape, one level up. If we can't trust the labels we use to read the model's insides, we're left with a harder version of the same question on the outside: was the output really backed by what the model actually did?

Inside the model

Feature-label mismatch

The auto-generated label says "cats." The feature actually fires on "cat" inside "cataracts." The label is a confident description that the firing pattern does not support.

Outside the model

Output hallucination

The LLM confidently writes a fact that isn't true. The output is a confident description that the world — or the model's own reasoning trail — does not support.

The fix is the same in both cases: don't trust a description on its own. Check it against the underlying evidence — the firing pattern for a feature, the source documents for a claim. That's what this page has been training your eye to do.

Project writeup

How and why we built this.

What have you done so far?

We built a static page that asks what an LLM is thinking about when it reads a sentence. It loads 6,000 public GPT-2 features from Neuronpedia across steps 0 to 11, with each row carrying an auto-generated label, a firing rate and 2-D coordinates for the dot map. The hook is one feature whose label says "cats" but fires on the letters c-a-t inside "cataracts," an eye-surgery word with no cats in it. A scroll story walks through four labeled features — URL, Python, cat and Star Wars — and applies the same three-part check to each: does the label match the activation text, and does the activation text match the full sentence? Supporting that story are five charts: a trust matrix, a topic heatmap, a rarity histogram, a word-match strip and the dot map. The takeaway connects the mismatch to LLM hallucination, framing both as confident descriptions that the underlying evidence does not support.

What will be the most challenging part to design and why?

The hardest part is making sparse-autoencoder features readable for a viewer with no AI background. A feature is invisible, so the page has to build the concept before it can criticize the label, which means the first screen has to carry both the introduction and the hook at the same time. The sticky stage also has to hold four components on a laptop — verdict, evidence meters, dot map and sentence window — without the layout collapsing at smaller viewport heights. Tying the mismatch to hallucination is risky: claim it too softly and the page reads as a curiosity, claim it too strongly and it overgeneralizes from one sample, so the framing has to stay precise.