You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ATLAS — Attention's Latent Atlas of Specialized heads in Transformers
Formerly NQP (Natural Quantization via State Preparation). The founding quantization
hypothesis was refuted; the project's destination turned out to be the geometry it uncovered
on the way: an atlas of head-specific, mutually non-aligned manifolds in the residual of
transformer attention — found across four autoregressive families (GPT-2, Qwen, Llama, Mistral).
The repository keeps the name NQP on disk for git continuity, but the project's goal — and
this README — point at the atlas.
Principal investigator: Juan Pablo Chancay · jpcpol@gmail.comStarted: 2026-06-24 · Target venue: NeurIPS / ICLR (workshop)
License: see LICENSE.md — CC BY-NC 4.0 (docs/theory) + AGPL-3.0 (src)
The arc in one sentence
The project began by asking "can we quantize an LLM better by rotating its weights into the natural
Fisher basis (analogous to measuring in the Hamiltonian's eigenbasis)?" — and ended up answering
"that idea did not work, but refuting it revealed a new, reproducible geometric structure in
attention: each head lives in its own low-dimensional (~7–11D) manifold, and these manifolds are
mutually non-aligned (overlap O_h ≪ 1) across four autoregressive transformer families — GPT-2,
Qwen2.5, Llama-3.1, Mistral. The existence is architecture-robust; the magnitude clusters by
attention design (GPT-2/Qwen ≈ 0.28, Llama/Mistral ≈ 0.20)."
Original hypothesis (refuted)
Surviving result (positive)
Quantization in the Fisher basis P̂ beats GPTQ+AWQ+QuIP
The activation Fisher is rank ~2 → collapses onto the baselines. Refuted.
A better deployment method exists
What exists is a representation-level object, not a compression trick
The attention tail is compressible
A scale-invariant atlas of head-specific manifolds (O_h ≈ 0.28), geometrically real but not functionally compressible by a simple autoencoder (Q06)
Scientific objective (largely achieved): understand the geometric organization of attention
and its functional role. → There is an atlas of head-specific manifolds, non-aligned, robust
across scale, corpus, and architecture (GPT-2 / Qwen / Llama / Mistral).
Applied objective (open): determine whether this organization can be exploited to build more
efficient, adaptive, or interpretable Transformers — without repeating NQP's mistake: the
existence of a geometric structure does not imply it is exploitable.
Complete preprint. Central result: an atlas of non-aligned head-specific manifolds across four autoregressive families (O_h ≪ 1; magnitude clusters by attention design). §0 ties results to the original objective; §3.1b is the cross-architecture result; §5 Related Work, refs [1]–[23]. Start here.
Cross-architecture research + roadmap: prior-work positioning, GQA obstacles, staged Phase 0–3 with gates, and the recorded Phase 0/1/2 + d_local-control results (Case B).
Systematic quantum-mechanics ↔ Transformers map. Origin of the successor line; sorts the analogy into decorative / inert / exact (softmax = Boltzmann).
A→B→C roadmap of Fisher quantization and the record of its refutation (gates A-G1…A-G4, L2-error vs PPL). Key reading for why the original idea did not pass.
💻 src/ — Implementation and measurement
Original branch — Fisher quantization (documented negative results)
Architecture-agnostic residual extraction (GPT-2 / Llama / Mistral / Qwen2 backends; handles RMSNorm, RoPE, GQA). Lets the same O_h protocol run on any of the four families.
Intra-model confounder control: within one model, correlate each head's intrinsic dimension against its overlap (Spearman + permutation p). Discriminates d_head vs d_int as the cross-arch driver — without retraining.
Scale-is-not-the-lever control: hold d_head fixed (GPT-2 family), vary size, measure O_h and the per-layer d_int profile at fixed relative depth. Separates peak from plateau d_int (per Valeriani [18]); shows O_h tracks the plateau, not scale.
Regression gate: the backend refactor must reproduce GPT-2's O_h = 0.284 bit-for-bit.
Reproducing the figures
cd src
python figure_data.py # → docs/figure_data.json (collects matrices fresh, ~minutes on CPU)
python make_figures.py # → docs/figures/*.png
Models: the GPT-2 family (124M / 355M / 774M, via transformers). Data: WikiText-103 validation
(+ C4 for the inter-corpus control). All cross-scale comparisons fix N / number of heads / relative
depth.
Status
Component
Status
Fisher quantization (NQP-C1)
❌ Refuted — collapses onto GPTQ+AWQ+QuIP
Uncertainty principle (NQP-U1)
⚠️ Partial — bases do not commute but with no operational consequence
✅ Done — d_head is confounded with intrinsic dimension (Qwen ρ=−0.53, p=3e-4; GPT-2 same sign, n.s.). Lead demoted to "leading suspect"
Scale-is-not-the-lever control (d_head fixed)
✅ Done — across GPT-2 family (d_head=64, 12→36 layers) O_h (spread 0.002) and plateau d_int (0.15) are flat; only peak d_int grows (1.45). O_h tracks plateau d_int, not scale
Architectural ablation — what component sets O_h?
⬜ Open (d_head is the lead suspect, confounded with plateau-d_int; scale ruled out; ablation must vary d_head and measure plateau-d_int as mediator)
Future directions (prioritized, with NQP's caution)
What architectural component sets O_h? — the cross-architecture result turned the existence
question into this sharper one. The lead from our four points is head dimension (d_head 64 →
≈0.28, d_head 128 → ≈0.20, even though Qwen already has GQA/RoPE/RMSNorm) — but an intra-model
control showed d_head is confounded with intrinsic dimension, which predicts overlap
head-by-head in at least one family (Qwen ρ=−0.53). A second control ruled scale out as the
lever: across the GPT-2 family (d_head fixed) O_h and the plateau intrinsic dimension are flat
while only the peak d_int grows — so O_h tracks plateau-d_int, set by d_head, not by size. The
reframed question (cf. the latent-quantity intuition, vetted against [18]) is what minimal
geometric quantity jointly organizes plateau-d_int and O_h, and which architectural decisions
modulate it? The matched-scale ablation over {MHA↔GQA, #KV heads, d_head, RoPE↔learned,
RMSNorm↔LayerNorm} must vary d_head while measuring plateau-d_int as a (post-treatment) mediator
— exploratory only, not a causal claim.
Caveat (the NQP lesson): this establishes architecture→O_h, not O_h→quality.
Geometric routing across heads — dynamic activation of a subset of heads (MoE-like, but by
latent geometry rather than learned logits).
Diagnostic metrics — use O_h to detect head collapse/redundancy; requires no architectural
change.
The central medium-term question: is the geometry causal or merely descriptive?
About
A scale-invariant atlas of head-specific, non-aligned manifolds in transformer attention residuals (O_h ≈ 0.28). Began as Fisher-eigenbasis quantization (NQP — refuted); refuting it revealed the geometry. Research code + frozen preprint.