A tiny HRM 27M beat Opus 4 and o3 on the ARC AGI benchmark
That sounds very surprising.
Most of the SOTA models we hear about (Claude, GPT, Gemini, Grok) are hundreds of billions to a few trillion parameters. We can't realistically run those locally. So when Sapient Intelligence published HRM (paper in June, model out in July) everyone was impressed that something a thousand times smaller can outperform larger counterparts on a ARC AGI benchmark. But size alone doesn't tell the full story.
HRM scored 32% on ARC AGI 1 and 2% on ARC AGI 2 without pre-training. That's the interesting part, as the current foundation models are trained on almost the entire internet. It shows something different can work for reasoning on limited data.
For context from recent large models:
- Qwen3-coder (480B parameter model) was trained on 7.5 trillion tokens
- Kimi-k2 (1T parameter model) saw 15.5 trillion tokens
As we understand HRM is not a foundation model. It lacks:
1. Broad pre-training data.
2. Transfer learning capability.
So we can't expect HRM to write polished emails or build software apps from scratch. What it can do is solve logical puzzles and show reasoning aptitude exactly what ARC AGI measures.
How does HRM work?
HRM's core novelty is architectural, inspired (loosely) by neuroscience. Instead of one transformer predicting the next token, HRM uses a dual recurrent loop. Two recurrent neural networks operating at different time scales:
- A fast, lower-level module that generates bursts of thinking (rapid, local processing)
- A slower, higher-level module that is more abstract and refines the fast module's outputs
Think of it as fast intuition + slow deliberation. The higher module provides guidance, the lower module produces quick hypotheses which are iteratively refined. This outer-refinement loop feeding fast bursts into a slower, abstract controller is what gives HRM its compositional reasoning advantage on specific tasks.
Transformers, by contrast, optimize next-token prediction over large contexts and scale extremely well in parallel training. RNNs historically struggled with long-term dependencies and early convergence to suboptimal representations. HRM sidesteps some of those issues by bifurcating timescales and adding that iterative refinement.
Why this matters?
HRM is not a transformer killer or an immediate replacement for foundation models. It's not a threat to the scale and pre-training paradigm that powers GPT-class models. Instead, it's a refinement in the AGI/reasoning direction an architecture that shows RNN-style systems can be competitive on reasoning tasks without trillions of tokens.
Now RNNs could come back. We've traded a lot of learning quality for speed of scaling. If new architectures learn better, we might not need monstrous models to get strong reasoning.
A path toward more efficient AGI-style reasoning. Transformer scale has limits eventual saturation, huge compute, and the impossibility of local runs. HRM hints at designs that might be more compute-friendly for certain classes of reasoning.
Biology-inspired loops are promising again. The outer refinement loop is simple but powerful iterate, correct, repeat. That's central to human reasoning and now seems useful in neural nets too.
Pre-training is still a massive advantage. Transformers win because they parallelize well and learn from massive batches. HRM hasn't been scaled with that same pre-training playbook yet.

