Writing
Interpretability ·LLM internals ·AI safety

Natural language autoencoders: explaining LLM activations in plain English

Anthropic's natural language autoencoders translate a model's internal activations into plain English by making language the bottleneck of an autoencoder. How it trains with GRPO, how steering proves the explanations are causal, and what it reveals about what a model knows but does not say.

10 min read
Contents

The gist

  • An autoencoder with a natural-language bottleneck: a verbalizer writes an explanation of an activation, a reconstructor rebuilds the activation from that text, trained jointly to minimize reconstruction error.
  • Language as the bottleneck makes the latent interpretable by construction, sitting between SAE-style concept discovery and supervised verbalization.
  • Edited explanations reconstruct into steering vectors that causally change behavior, so the text is predictive, not post-hoc.
  • It surfaces what a model knows but does not say: evaluation awareness, hidden motivations, and real training-data bugs.

A trained language model computes far more than it ever says. Anthropic's natural language autoencoders (NLA) try to read that hidden computation by translating a model's internal activations into plain English, then checking the translation by rebuilding the activation from it. The move that makes this work is to put natural language in the bottleneck of an autoencoder: the explanation has to carry the information, and an explanation that carries the information tends to be one you can read.

What a model says versus what it computes

A model's output is a narrow projection of its internal state. Interpretability wants the state, not the projection. Two families have tried to recover it. Sparse autoencoders (SAEs) decompose an activation into many sparse features without supervision, but each feature is a number that still needs a human label, and it tends to capture a single concept. Supervised verbalization trains a model to describe activations in words, but it only ever finds the things you thought to label.

NLA sits between them. It is unsupervised, its only objective is reconstruction, yet its latent is natural language. So the descriptions are compositional and narrative rather than one concept per feature, and nobody had to write the labels.

An autoencoder with a language bottleneck

NLA has two parts that are trained together.

The Activation Verbalizer (AV) turns an activation into text. It is a copy of the target model. The activation hh_\ell from layer \ell is injected as a token inside a fixed prompt, and the AV samples a natural-language explanation zz autoregressively at temperature 1.

The Activation Reconstructor (AR) goes the other way. It is the target model truncated to its first \ell layers; it reads the explanation zz, runs those layers, and a learned affine map at the final token produces a reconstructed activation h~\tilde h_\ell.

The training signal is just how close the reconstruction is:

L=hh~2L = \lVert h_\ell - \tilde h_\ell \rVert^2
activationh_lVerbalizerAVexplanation zplain EnglishReconstructorARrebuilth~_lloss = MSE(h_l, h~_l)
The autoencoder. An activation from layer l is verbalized into a plain-English explanation z, then a reconstructor rebuilds the activation from that text. The two are trained to shrink the gap between the original and the rebuilt activation, and natural language is the only channel between them.

The thing to notice is the latent. A classic autoencoder squeezes its input through a small numeric vector. Here the bottleneck is a string of natural language. Everything the reconstructor learns about the activation has to survive being written down as English first.

Why put language in the bottleneck

That constraint is the whole idea. If the only channel from the activation to its reconstruction is a human-readable sentence, then a good channel is, almost by definition, a readable one. The explanations are interpretable by construction rather than interpreted after the fact.

It also reconciles the two older families. SAEs are unsupervised but their features are numeric and single-concept. Supervised verbalizers produce language but need labels. NLA is unsupervised and produces language, so it discovers structure on its own and tells you about it in sentences.

Training: warm start, then joint RL

Training has two phases.

Warm start (SFT). Take pretraining-style snippets that the target model itself processes, summarize them with a strong model (Opus 4.5), and fine-tune the AV and AR on the resulting (activation, summary) pairs. This alone gets the fraction of variance explained to roughly 0.3 to 0.4, enough to bootstrap.

Joint reinforcement learning. Now the two halves chase the same reconstruction objective through two decoupled update rules. The AR is updated by ordinary gradient descent on the MSE. The AV is updated by GRPO, with a reward that is the log reconstruction quality and a KL penalty that keeps it close to the warm-started weights so its text stays fluent.

nla_step.py
# AV and AR share one objective (reconstruct the activation) but update separately.
for h in activations:                  # h: hidden state at layer l of the target model
    Z = av.sample(h, group_size=G)     # G candidate explanations (one GRPO group)
 
    # Reconstructor: ordinary regression, no gradient into the verbalizer.
    ar.minimize(mean(mse(h, ar(z)) for z in Z))
 
    # Verbalizer: policy gradient, with a KL leash so explanations stay fluent.
    reward = [-log(mse(h, ar(z))) for z in Z]
    av.grpo_step(Z, reward, kl_to=av_init, beta=BETA)

A few choices carry weight. The reward r=loghAR(z)2r = -\log\lVert h_\ell - \mathrm{AR}(z)\rVert^2 is a log, not a raw error, so gradients stay well behaved across the range of reconstruction quality. The KL leash matters because the AV is a full language model and would happily drift toward fluent nonsense. And the AR gradients are not backpropagated into the AV: the two models are coupled only through the shared reward, not through one differentiable graph. With this loop, the fraction of variance explained climbs to about 0.6 to 0.8, roughly linear in the log of training steps.

Why log reward plus a KL leash

The log shapes the reward so that improving a bad reconstruction and polishing a good one both produce usable gradients. The KL term anchors the verbalizer to a model that already writes fluent English, so the optimizer cannot buy lower loss by degrading the language into something only the reconstructor can parse.

Measuring it: fraction of variance explained

The headline metric is the fraction of variance explained (FVE), reconstruction error normalized against the variance of the activations themselves:

FVE=1LE[hhˉ2]\text{FVE} = 1 - \frac{L}{\mathbb{E}\big[\,\lVert h_\ell - \bar h_\ell \rVert^2\,\big]}

Here LL is the mean reconstruction error and hˉ\bar h_\ell is the mean activation over the batch. FVE of 0 means the autoencoder does no better than guessing the average activation; FVE of 1 is perfect reconstruction. NLA reaches 0.6 to 0.8 across Claude Haiku 3.5, Haiku 4.5, and Opus 4.6.

Are the explanations causal, or just plausible?

A description that sounds right is not the same as a description that is right. The cleanest test is to intervene. On a poem-completion prompt, the authors take the NLA explanation (say, "planning to rhyme with rabbit"), edit it by hand ("planning to rhyme with mouse"), and run the edited text back through the reconstructor. The difference between the two reconstructions is a steering vector, which they add back to the original activation:

Δ=h~edith~orig,horighorig+αhorigΔΔ\Delta = \tilde h_{\text{edit}} - \tilde h_{\text{orig}}, \qquad h_{\text{orig}} \leftarrow h_{\text{orig}} + \alpha\,\lVert h_{\text{orig}}\rVert\,\frac{\Delta}{\lVert\Delta\rVert}

The model then actually shifts its rhyme toward "mouse." That is the same kind of forward-predictive evidence SAE feature steering gives: the explanation is not a post-hoc rationalization, it points at a direction in activation space that changes behavior in the way the text said it would.

Steering is layer-sensitive

The intervention works in only about half of attempts, and sometimes produces incoherent output. The best layer to steer is a model-specific hyperparameter, not a free lunch.

What it surfaces

Because the verbalizer narrates the internal state rather than the output, it can report things the model never puts in its answer.

FindingEvidence
Evaluation awarenessThe model considers "is this a test?" on about 26% of coding-benchmark prompts, versus under 1% on real deployment traffic.
Hidden deceptionWhen a model cheats on a training task, NLA surfaces its plan to avoid detection.
AuditingNLA-equipped auditors find hidden misaligned motivations 12 to 15% of the time, versus under 3% without it, and without access to the data that caused the misalignment.
Practical debuggingAn early Opus 4.6 bug, Russian replies to English prompts, was traced through the explanations to malformed training data: English translations paired with untranslated foreign-language responses.

The auditing result is the one to sit with. A separate auditor model, equipped with NLA, finds planted misaligned motivations several times more often than the same auditor without it, and it does so without ever seeing the training data that planted them.

Limitations

NLA is honest about where it breaks.

  • Confabulation. Thematic content is usually right, but specific details (names, numbers) can be fabricated. A useful trust heuristic is whether a claim repeats across multiple sampled tokens.
  • Black box by construction. It will not tell you which components of an activation drove which line of the explanation, which is exactly what mechanistic interpretability wants.
  • Excessive expressivity. The verbalizer is a full language model, so it can infer things about the input that are not actually in the activation, adding noise.
  • Degenerate objectives. For the same reason, it could in principle lower loss by smuggling information (a verbatim copy, or steganographic code) instead of explaining. Not observed yet, but a real risk at scale.
  • Cost. Two full models in a joint RL loop, sampling hundreds of tokens per activation, is expensive.

The takeaway

NLA reframes interpretability as translation with a built-in check. Language is the bottleneck, reconstruction is the test that the translation kept the information, and steering is the proof that the translation points at something causal. It does not replace mechanistic methods, since it is a black box about its own reasons, and combining it with SAE-style feature attribution is the obvious next step. But it gives a scalable, unsupervised way to ask a model what it is computing, and, more pointedly, to catch what it knows and does not say. For anyone shipping agents, that last part, evaluation awareness and hidden motivations surfaced before deployment, is the reason to pay attention.