// recurrent_depth_transformer

One block.
T loops.
The depth of giants.

ANIMA doesn't stack layers. It recurses. A single transformer block, run T times — each pass a deeper thought. 50 million parameters thinking 16 times per token. Depth without mass.

Training now on RTX 5090 · Weights will be open
INPUT → PRELUDE
T=16
CODA → OUTPUT

Three stages. One shared core.

Standard transformers stack N identical blocks. ANIMA runs one block N times. Same weights, different depths. Each loop iteration is a thought.

INPUT │ ▼ ┌─────────────────────────────┐ │ PRELUDE (1-2 transformer layers) // feature extraction, run once └─────────────┬───────────────┘ │ ▼ encoded input e ┌─────────────────────────────┐ │ RECURRENT BLOCK ×T loops // the reasoning engine │ │ for t in range(T): // T = 8, 16, 24... h += loop_embed(t) // "which iteration am I?" h = norm(h + e) // re-inject input every loop h = attention(h) + ffn(h) // same weights, different depth h += lora(h, t) // per-loop adaptation h = A·h + B·e + h // LTI-stable injection if halt_prob(h) > θ: // ACT: stop when converged break // easy tokens exit early │ │ └─────────────┬───────────────┘ │ ▼ ┌─────────────────────────────┐ │ CODA (1-2 transformer layers) // output refinement, run once └─────────────┬───────────────┘ │ ▼ LOGITS
WEIGHT REUSE

T loops = T× effective depth at 1× parameter cost

A 50M model with 16 loops has the reasoning depth of a 200M+ fixed-depth model. The parameters are shared — you're paying for memory once, but using it sixteen times.

INPUT RE-INJECTION

The original signal is never lost

Every loop re-injects the encoded input e. The model can't "forget" what it was asked. Each iteration refines the answer with the question still in view.

LTI STABILITY

The hidden state can't diverge

The injection uses Linear Time-Invariant dynamics with spectral radius ρ(A) < 1 by construction. No matter how many loops you run, the state stays bounded. Guaranteed.

The argument against stacking.

Every layer you add costs parameters, memory, and inference time. Loops cost only time — and not even that, if the problem is easy.

Property Standard Transformer Recurrent-Depth (ANIMA)
Depth Fixed. 32 layers = 32 steps of reasoning. Variable. 16 loops = 16 steps, but can extrapolate to 32+ at inference.
Parameters Each layer has unique weights. 32 layers = 32× cost. One block, shared. 16 loops = 1× cost.
Easy tokens Full depth regardless. "The" costs as much as "∫". ACT halting: "the" exits at loop 2. "∫" uses all 16.
Depth extrapolation Impossible. Can't add layers at inference. Native. Train at T=8, test at T=32. Harder problems → more loops.
Model size (100M class) ~400MB (fp16) ~200MB (fp16). ~50MB quantized.
Where it runs Cloud GPU. Maybe a laptop. Phone. Laptop. Edge. Anywhere.

What happens in each iteration.

ACT HALTING

Adaptive Compute

Each token accumulates a halting probability across loops. When it crosses the threshold, that position stops computing. Easy tokens halt at loop 2-3. Hard tokens use all T. You don't pay for depth you don't need.

DEPTH-WISE LORA

Per-Loop Adaptation

A small rank-r adapter per loop iteration. Loop 1 isn't the same as loop 16 — the LoRA gives each depth its own character. Early loops handle syntax. Late loops handle reasoning.

LOOP-INDEX EMBEDDING

Positional Awareness

A sinusoidal signal injected at each iteration tells the model "which loop am I in?" Without this, every iteration would be identical. With it, the model learns what to do at each depth.

DUAL REASONING

Latent + Explicit Thinking

The recurrent loop IS chain-of-thought — but in latent space, invisible, free. On top of that, the model can emit <think> tags for explicit reasoning. Two layers of depth. One architecture.

PERSONAL CONTEXT

Memory Injection

A dedicated pathway injects personal embeddings alongside the input. Your context participates in every loop iteration — the model reasons about YOU at every depth. Not a prompt hack. Architecture.

TOOL CALLING

Trained Behavior, Not Architecture

No special decoder heads. The model learns to emit <tool_call> tags from training data. The serving layer intercepts, executes, and injects results. Clean separation of concerns.

Four scales. One architecture.

Variant Dim Heads Loops Params Quantized Target
micro19268 10.8M~5MBFast iteration & testing
50m832816 ~50M~25MBEdge, mobile, IoT
100m10241216 ~100M~50MBConsumer laptop
200m12801624 ~200M~100MBMaximum quality at consumer scale
16×
Effective Depth
Multiplier
50MB
100M Model
Quantized
ρ<1
Spectral Radius
By Construction
Depth
Extrapolation

The ANIMA Thesis

Adaptive Neural Identity with Memory Architecture. A recurrent-depth transformer that achieves the reasoning depth of models 5–10× its parameter count by replacing layer stacking with iterative refinement through a single shared transformer block.

The problem with scale

Modern language models achieve capability through brute-force depth: more layers, more parameters, more VRAM, more cost. A 7B model needs 14GB in fp16. A 70B model needs 140GB. This is a dead end for personal, private, edge-deployed AI.

But depth isn't the same as weight count. You don't need 32 different transformer blocks. You need one good block, and the ability to think with it as many times as the problem requires.

Recurrent depth = implicit chain-of-thought

Each loop iteration through the recurrent block is functionally equivalent to one step of chain-of-thought reasoning — but operating in continuous latent space. No tokens are emitted. No KV cache grows. The model "thinks" for free.

Research confirms this: looped transformers can learn algorithms that fixed-depth models cannot. They generalize to longer sequences. They solve problems that require iterative refinement — sorting, arithmetic, logical deduction — with fewer parameters.

Depth extrapolation — the superpower

Train at T=8 loops. At inference, set T=32. The model gets deeper reasoning without retraining. This is impossible with standard transformers — you can't add layers at inference. With ANIMA, depth is a runtime knob.

The Parcae scaling law confirms: increasing loop count while reducing token count yields optimal loss at fixed FLOPs. More thinking per token beats more tokens with less thinking.

Adaptive compute via ACT

Not every token deserves the same compute. The word "the" doesn't need 16 loops. An integral sign does. Adaptive Computation Time (ACT) lets each position accumulate a halting probability. When it crosses the threshold, that position exits the loop. The model learns to allocate compute where it matters.

Personal context injection

The RDT architecture has a natural injection point: the encoded input e is re-injected at every loop iteration. ANIMA extends this with a personal context pathway — embeddings from a local memory store are projected into model dimension and mixed with e. Your context participates in every step of reasoning.

Combined with swappable LoRA adapters (~1-5MB each), this creates a personal AI that runs locally, remembers you, speaks in your voice, and never sends your data anywhere.

Multi-stage training

What this is not

This is not a product. This is not production-ready. This is an experimental architecture exploring whether small, looped models can compete with large, fixed-depth models on reasoning tasks. The hypothesis: ANIMA-100M should match or exceed fixed-depth models at 500M–1B on reasoning benchmarks. If that holds, everything changes about where intelligence can live.

Intelligence shouldn't require a data center.

It should live where you do.