ANIMA — Recurrent-Depth Transformer

One block.
T loops.
The depth of giants.

ANIMA doesn't stack layers. It recurses. A single transformer block, run T times — each pass a deeper thought. 50 million parameters thinking 16 times per token. Depth without mass.

Training now on RTX 5090 · Weights will be open

Three stages. One shared core.

Standard transformers stack N identical blocks. ANIMA runs one block N times. Same weights, different depths. Each loop iteration is a thought.

INPUT │ ▼ ┌─────────────────────────────┐ │ PRELUDE (1-2 transformer layers) // feature extraction, run once └─────────────┬───────────────┘ │ ▼ encoded input e ┌─────────────────────────────┐ │ RECURRENT BLOCK ×T loops // the reasoning engine │ │ │ for t in range(T): // T = 8, 16, 24... │ h += loop_embed(t) // "which iteration am I?" │ h = norm(h + e) // re-inject input every loop │ h = attention(h) + ffn(h) // same weights, different depth │ h += lora(h, t) // per-loop adaptation │ h = A·h + B·e + h // LTI-stable injection │ if halt_prob(h) > θ: // ACT: stop when converged │ break // easy tokens exit early │ │ └─────────────┬───────────────┘ │ ▼ ┌─────────────────────────────┐ │ CODA (1-2 transformer layers) // output refinement, run once └─────────────┬───────────────┘ │ ▼ LOGITS

WEIGHT REUSE

T loops = T× effective depth at 1× parameter cost

A 50M model with 16 loops has the reasoning depth of a 200M+ fixed-depth model. The parameters are shared — you're paying for memory once, but using it sixteen times.

INPUT RE-INJECTION

The original signal is never lost

Every loop re-injects the encoded input e. The model can't "forget" what it was asked. Each iteration refines the answer with the question still in view.

LTI STABILITY

The hidden state can't diverge

The injection uses Linear Time-Invariant dynamics with spectral radius ρ(A) < 1 by construction. No matter how many loops you run, the state stays bounded. Guaranteed.

The argument against stacking.

Every layer you add costs parameters, memory, and inference time. Loops cost only time — and not even that, if the problem is easy.

Property	Standard Transformer	Recurrent-Depth (ANIMA)
Depth	Fixed. 32 layers = 32 steps of reasoning.	Variable. 16 loops = 16 steps, but can extrapolate to 32+ at inference.
Parameters	Each layer has unique weights. 32 layers = 32× cost.	One block, shared. 16 loops = 1× cost.
Easy tokens	Full depth regardless. "The" costs as much as "∫".	ACT halting: "the" exits at loop 2. "∫" uses all 16.
Depth extrapolation	Impossible. Can't add layers at inference.	Native. Train at T=8, test at T=32. Harder problems → more loops.
Model size (100M class)	~400MB (fp16)	~200MB (fp16). ~50MB quantized.
Where it runs	Cloud GPU. Maybe a laptop.	Phone. Laptop. Edge. Anywhere.

Property

Standard Transformer

Recurrent-Depth (ANIMA)

Depth

Fixed. 32 layers = 32 steps of reasoning.

Variable. 16 loops = 16 steps, but can extrapolate to 32+ at inference.

Parameters

Each layer has unique weights. 32 layers = 32× cost.

One block, shared. 16 loops = 1× cost.

Easy tokens

Full depth regardless. "The" costs as much as "∫".

ACT halting: "the" exits at loop 2. "∫" uses all 16.

Depth extrapolation

Impossible. Can't add layers at inference.

Native. Train at T=8, test at T=32. Harder problems → more loops.

Model size (100M class)

~400MB (fp16)

~200MB (fp16). ~50MB quantized.

Where it runs

Cloud GPU. Maybe a laptop.

Phone. Laptop. Edge. Anywhere.

What happens in each iteration.

ACT HALTING

Adaptive Compute

Each token accumulates a halting probability across loops. When it crosses the threshold, that position stops computing. Easy tokens halt at loop 2-3. Hard tokens use all T. You don't pay for depth you don't need.

DEPTH-WISE LORA

Per-Loop Adaptation

A small rank-r adapter per loop iteration. Loop 1 isn't the same as loop 16 — the LoRA gives each depth its own character. Early loops handle syntax. Late loops handle reasoning.

LOOP-INDEX EMBEDDING

Positional Awareness

A sinusoidal signal injected at each iteration tells the model "which loop am I in?" Without this, every iteration would be identical. With it, the model learns what to do at each depth.

DUAL REASONING

Latent + Explicit Thinking

The recurrent loop IS chain-of-thought — but in latent space, invisible, free. On top of that, the model can emit <think> tags for explicit reasoning. Two layers of depth. One architecture.

PERSONAL CONTEXT

Memory Injection

A dedicated pathway injects personal embeddings alongside the input. Your context participates in every loop iteration — the model reasons about YOU at every depth. Not a prompt hack. Architecture.

TOOL CALLING

Trained Behavior, Not Architecture

No special decoder heads. The model learns to emit <tool_call> tags from training data. The serving layer intercepts, executes, and injects results. Clean separation of concerns.

Four scales. One architecture.

Variant	Dim	Heads	Loops	Params	Quantized	Target
micro	192	6	8	10.8M	~5MB	Fast iteration & testing
50m	832	8	16	~50M	~25MB	Edge, mobile, IoT
100m	1024	12	16	~100M	~50MB	Consumer laptop
200m	1280	16	24	~200M	~100MB	Maximum quality at consumer scale

Variant

Dim

Heads

Loops

Params

Quantized

Target

micro

192

10.8M

~5MB

Fast iteration & testing

50m

832

~50M

~25MB

Edge, mobile, IoT

100m

1024

~100M

~50MB

Consumer laptop

200m

1280

~200M

~100MB

Maximum quality at consumer scale

16×

Effective Depth
Multiplier

50MB

100M Model
Quantized

ρ<1

Spectral Radius
By Construction

∞

Depth
Extrapolation

// whitepaper_excerpt

The ANIMA Thesis

Adaptive Neural Identity with Memory Architecture. A recurrent-depth transformer that achieves the reasoning depth of models 5–10× its parameter count by replacing layer stacking with iterative refinement through a single shared transformer block.

The problem with scale

Modern language models achieve capability through brute-force depth: more layers, more parameters, more VRAM, more cost. A 7B model needs 14GB in fp16. A 70B model needs 140GB. This is a dead end for personal, private, edge-deployed AI.

But depth isn't the same as weight count. You don't need 32 different transformer blocks. You need one good block, and the ability to think with it as many times as the problem requires.

Recurrent depth = implicit chain-of-thought

Each loop iteration through the recurrent block is functionally equivalent to one step of chain-of-thought reasoning — but operating in continuous latent space. No tokens are emitted. No KV cache grows. The model "thinks" for free.

Research confirms this: looped transformers can learn algorithms that fixed-depth models cannot. They generalize to longer sequences. They solve problems that require iterative refinement — sorting, arithmetic, logical deduction — with fewer parameters.

Depth extrapolation — the superpower

Train at T=8 loops. At inference, set T=32. The model gets deeper reasoning without retraining. This is impossible with standard transformers — you can't add layers at inference. With ANIMA, depth is a runtime knob.

The Parcae scaling law confirms: increasing loop count while reducing token count yields optimal loss at fixed FLOPs. More thinking per token beats more tokens with less thinking.

Adaptive compute via ACT

Not every token deserves the same compute. The word "the" doesn't need 16 loops. An integral sign does. Adaptive Computation Time (ACT) lets each position accumulate a halting probability. When it crosses the threshold, that position exits the loop. The model learns to allocate compute where it matters.

Personal context injection

The RDT architecture has a natural injection point: the encoded input e is re-injected at every loop iteration. ANIMA extends this with a personal context pathway — embeddings from a local memory store are projected into model dimension and mixed with e. Your context participates in every step of reasoning.

Combined with swappable LoRA adapters (~1-5MB each), this creates a personal AI that runs locally, remembers you, speaks in your voice, and never sends your data anywhere.

Multi-stage training

Stage 1 — Pretrain: FineWeb-Edu, raw next-token prediction. The model learns language.
Stage 2 — Reasoning: Math, science, code traces. Chain-of-thought distillation. The loops learn to reason.
Stage 3 — Instruct: ChatML conversations, tool calling, think tags. The model learns to be useful.
Stage 4 — DPO: Preference alignment. The model learns judgment.
Stage 5 — Personal: Your data, your voice, your LoRA. The model becomes yours.

What this is not

This is not a product. This is not production-ready. This is an experimental architecture exploring whether small, looped models can compete with large, fixed-depth models on reasoning tasks. The hypothesis: ANIMA-100M should match or exceed fixed-depth models at 500M–1B on reasoning benchmarks. If that holds, everything changes about where intelligence can live.

One block.T loops.The depth of giants.