ANIMA doesn't stack layers. It recurses. A single transformer block, run T times — each pass a deeper thought. 50 million parameters thinking 16 times per token. Depth without mass.
// architecture
Standard transformers stack N identical blocks. ANIMA runs one block N times. Same weights, different depths. Each loop iteration is a thought.
A 50M model with 16 loops has the reasoning depth of a 200M+ fixed-depth model. The parameters are shared — you're paying for memory once, but using it sixteen times.
Every loop re-injects the encoded input e. The model can't "forget" what it was asked. Each iteration refines the answer with the question still in view.
The injection uses Linear Time-Invariant dynamics with spectral radius ρ(A) < 1 by construction. No matter how many loops you run, the state stays bounded. Guaranteed.
// why_loops_not_layers
Every layer you add costs parameters, memory, and inference time. Loops cost only time — and not even that, if the problem is easy.
| Property | Standard Transformer | Recurrent-Depth (ANIMA) |
|---|---|---|
| Depth | Fixed. 32 layers = 32 steps of reasoning. | Variable. 16 loops = 16 steps, but can extrapolate to 32+ at inference. |
| Parameters | Each layer has unique weights. 32 layers = 32× cost. | One block, shared. 16 loops = 1× cost. |
| Easy tokens | Full depth regardless. "The" costs as much as "∫". | ACT halting: "the" exits at loop 2. "∫" uses all 16. |
| Depth extrapolation | Impossible. Can't add layers at inference. | Native. Train at T=8, test at T=32. Harder problems → more loops. |
| Model size (100M class) | ~400MB (fp16) | ~200MB (fp16). ~50MB quantized. |
| Where it runs | Cloud GPU. Maybe a laptop. | Phone. Laptop. Edge. Anywhere. |
// inside_the_loop
Each token accumulates a halting probability across loops. When it crosses the threshold, that position stops computing. Easy tokens halt at loop 2-3. Hard tokens use all T. You don't pay for depth you don't need.
A small rank-r adapter per loop iteration. Loop 1 isn't the same as loop 16 — the LoRA gives each depth its own character. Early loops handle syntax. Late loops handle reasoning.
A sinusoidal signal injected at each iteration tells the model "which loop am I in?" Without this, every iteration would be identical. With it, the model learns what to do at each depth.
The recurrent loop IS chain-of-thought — but in latent space, invisible, free. On top of that, the model can emit <think> tags for explicit reasoning. Two layers of depth. One architecture.
A dedicated pathway injects personal embeddings alongside the input. Your context participates in every loop iteration — the model reasons about YOU at every depth. Not a prompt hack. Architecture.
No special decoder heads. The model learns to emit <tool_call> tags from training data. The serving layer intercepts, executes, and injects results. Clean separation of concerns.
// model_variants
| Variant | Dim | Heads | Loops | Params | Quantized | Target |
|---|---|---|---|---|---|---|
| micro | 192 | 6 | 8 | 10.8M | ~5MB | Fast iteration & testing |
| 50m | 832 | 8 | 16 | ~50M | ~25MB | Edge, mobile, IoT |
| 100m | 1024 | 12 | 16 | ~100M | ~50MB | Consumer laptop |
| 200m | 1280 | 16 | 24 | ~200M | ~100MB | Maximum quality at consumer scale |
// whitepaper_excerpt
Adaptive Neural Identity with Memory Architecture. A recurrent-depth transformer that achieves the reasoning depth of models 5–10× its parameter count by replacing layer stacking with iterative refinement through a single shared transformer block.
Modern language models achieve capability through brute-force depth: more layers, more parameters, more VRAM, more cost. A 7B model needs 14GB in fp16. A 70B model needs 140GB. This is a dead end for personal, private, edge-deployed AI.
But depth isn't the same as weight count. You don't need 32 different transformer blocks. You need one good block, and the ability to think with it as many times as the problem requires.
Each loop iteration through the recurrent block is functionally equivalent to one step of chain-of-thought reasoning — but operating in continuous latent space. No tokens are emitted. No KV cache grows. The model "thinks" for free.
Research confirms this: looped transformers can learn algorithms that fixed-depth models cannot. They generalize to longer sequences. They solve problems that require iterative refinement — sorting, arithmetic, logical deduction — with fewer parameters.
Train at T=8 loops. At inference, set T=32. The model gets deeper reasoning without retraining. This is impossible with standard transformers — you can't add layers at inference. With ANIMA, depth is a runtime knob.
The Parcae scaling law confirms: increasing loop count while reducing token count yields optimal loss at fixed FLOPs. More thinking per token beats more tokens with less thinking.
Not every token deserves the same compute. The word "the" doesn't need 16 loops. An integral sign does. Adaptive Computation Time (ACT) lets each position accumulate a halting probability. When it crosses the threshold, that position exits the loop. The model learns to allocate compute where it matters.
The RDT architecture has a natural injection point: the encoded input e is re-injected at every loop iteration. ANIMA extends this with a personal context pathway — embeddings from a local memory store are projected into model dimension and mixed with e. Your context participates in every step of reasoning.
Combined with swappable LoRA adapters (~1-5MB each), this creates a personal AI that runs locally, remembers you, speaks in your voice, and never sends your data anywhere.
This is not a product. This is not production-ready. This is an experimental architecture exploring whether small, looped models can compete with large, fixed-depth models on reasoning tasks. The hypothesis: ANIMA-100M should match or exceed fixed-depth models at 500M–1B on reasoning benchmarks. If that holds, everything changes about where intelligence can live.
Intelligence shouldn't require a data center.
It should live where you do.