In October 2025, Alexia Jolicoeur-Martineau, a senior researcher at Samsung’s Advanced Institute of Technology in Montreal, published a paper titled “Less is More: Recursive Reasoning with Tiny Networks.” The paper described what she called the Tiny Recursive Model, abbreviated as TRM, a 7-million-parameter neural network that achieves 45 percent accuracy on the ARC-AGI-1 benchmark and 8 percent on ARC-AGI-2, outperforming frontier models including Google’s Gemini 2.5 Pro and OpenAI’s o3-mini-high. The size differential is, on first reading, hard to absorb. TRM is roughly one ten-thousandth the size of the smallest frontier LLMs, and on the most demanding generalization benchmarks in current AI research, it beats them. The result has reopened a question the field had treated as settled.
What TRM actually does
The Tiny Recursive Model is not a general-purpose language model. It is a specialized reasoning system designed for structured problem-solving on benchmarks like ARC-AGI, Sudoku-Extreme, and Maze-Hard. The architecture is a single small neural network that the model recursively applies to its own outputs, refining both an internal reasoning state and a candidate answer through multiple iterative passes. The model is trained to know when to stop iterating, halting when its confidence in the current answer is high enough.
The architectural foundation builds on the Hierarchical Reasoning Model published earlier in 2025, which used two small networks recursing at different frequencies. TRM simplifies the architecture by collapsing the two networks into one, removing the biological-reasoning scaffolding that HRM had used to justify its design, and replacing it with cleaner mathematical formulations. The simplified architecture turned out to be more effective rather than less. With approximately 7 million parameters and substantially less training infrastructure than HRM required, TRM reaches state-of-the-art performance on the benchmarks it targets.
The training cost is the part of the story that makes the implications uncomfortable for the dominant scaling thesis. Training TRM cost under $500, took approximately two days on four GPUs, and is reproducible by any research lab with consumer-grade hardware. The deployment footprint is similarly lightweight: 7 million parameters fit on edge devices, the inference latency is in the millisecond range, and the model can run on a single consumer GPU without quantization. The economics of the system are an order of magnitude or more better than any comparable LLM on the workloads it handles.
The benchmark results
ARC-AGI-2, documented in our ARC-AGI-2 analysis, is the benchmark designed by François Chollet and collaborators to measure general fluid intelligence rather than memorized skills, calibrated against human performance on tasks that humans find easy and current AI struggles with. The benchmark was deliberately constructed to resist the brute-force pattern-matching approaches that allowed frontier LLMs to score well on ARC-AGI-1. By Chollet’s own framing, ARC-AGI-2 scores below 5 percent are generally not treated as meaningful signal. TRM’s 7.8 percent score is above that threshold, and it sits in the same range as Gemini 2.5 Pro’s 4.9 percent and well above several other frontier models.
On Sudoku-Extreme, a dataset of difficult Sudoku puzzles with only 1,000 training examples, TRM achieved 87.4 percent accuracy, setting a new state-of-the-art and improving substantially on the 55 percent score of its HRM predecessor. On Maze-Hard, which involves finding paths through complex 30×30 grids, TRM reached 85.3 percent. The pattern is consistent across the structured reasoning benchmarks: a tiny model beating much larger systems on tasks where the larger models’ approach to reasoning, namely extended chain-of-thought generation, does not appear to confer a meaningful advantage.
The model’s specific strengths and limitations are precise. TRM excels at problems with clear structure, finite solution spaces, and the kind of multi-step logical deduction that benefits from iterative refinement. It does not generate language, does not handle open-ended creative tasks, and does not perform well on benchmarks that reward broad world knowledge or fluent text generation. The comparison with frontier LLMs is therefore not a like-for-like substitution. TRM and Gemini 2.5 Pro solve different problems. TRM happens to solve a specific class of problems that the field had assumed would require larger architectures.
What the result implies for scaling
The dominant narrative of the LLM era has been that capability follows scale, with larger models trained on more data and more compute producing better outputs. The narrative is empirically supported by the trajectory from GPT-2 to GPT-5, with each generation producing measurable gains that justified the next generation’s compute investment. The narrative is also self-reinforcing in commercial terms: labs that can fund larger training runs gain capability advantages that justify the funding to fund still-larger training runs.
TRM does not refute the scaling thesis. It complicates it. The result demonstrates that for certain classes of problems, architectural innovation can substitute for scale at ratios that look implausible until they are reproduced. The same pattern has appeared elsewhere in 2025. Alibaba’s Qwen QwQ-32B, documented in our Qwen QwQ analysis, demonstrated that a 32-billion-parameter model with the right RL post-training could match a 671-billion-parameter competitor on reasoning tasks. The pattern surfaces across the broader LLM new models coverage and the State of LLMs 2025 analysis.
The implication for the field is that scaling is one path forward, not the only one. Architectural innovation, particularly around recursive reasoning, parameter reuse, and training objectives that incentivize structured problem-solving rather than next-token prediction, offers an alternative trajectory. The labs that bet exclusively on scale, on the implicit assumption that no architectural innovation will produce comparable returns, are exposed to the risk that TRM-like results compound across more problem classes over the next 24 months.
Where TRM-class systems fit in production
The architectural reorientation worth naming is that the field is now staring at a procurement question it did not face two years ago. For problems that fit the structured-reasoning template, namely tasks with verifiable solutions, finite state spaces, and the kind of multi-step deduction that benefits from iterative refinement, specialized small models like TRM may produce better results at fraction-of-a-percent costs compared to frontier LLMs. For problems requiring language understanding, world knowledge, or creative generation, frontier LLMs remain the appropriate choice.
The procurement question is therefore not “which model” but “which model class for which workload.” Organizations that route their reasoning workloads through frontier LLMs by default are likely overpaying for capabilities they do not need on tasks where a smaller specialized system would deliver equivalent or better results. The patterns documented in our agentic AI report and our enterprise AI governance coverage reflect the broader principle: the model selection layer is becoming as important as the models themselves.
The deployment-engineering implications are real. Building production systems that route between model classes based on task characteristics requires infrastructure that did not exist a year ago. The routing logic, the calibration data, and the fallback behavior when the routing classifier itself is wrong all become engineering work that AI architecture teams now have to do. The cost of building that infrastructure is part of the total cost of running a multi-model strategy, and it is non-trivial.
What the next 24 months will test
TRM is one data point. The question is whether the architectural insight generalizes. The next 18 to 24 months will produce, on the current trajectory, additional small specialized models that target other problem classes with comparable parameter counts and similar or better performance against frontier LLMs on the targeted tasks. The labs that have noticed the pattern, including Samsung’s SAIT Montreal, Intercom’s research team which published a Mamba-based variant of TRM, and various academic groups, are actively pursuing the technique.
If the pattern holds, the AI deployment landscape in late 2027 will look meaningfully different. Specialized small models for structured reasoning, language generation handled by frontier LLMs, image and video by diffusion models documented in our diffusion models 2025 analysis, and the routing layer between them all becoming standard production infrastructure. The labs that monopolize the frontier LLM tier will retain significant value, but the share of total inference compute they capture will shift as smaller specialized systems absorb workloads that previously required their architectures.
The question for AI architecture decision-makers
The TRM result is genuinely uncomfortable for organizations whose AI strategy assumes that frontier LLM capability will dominate every workload going forward. The result does not invalidate frontier LLMs, but it changes the procurement calculus on a meaningful share of the workloads enterprises are currently running through them. The leaders who absorb the implications quickly will reduce their AI infrastructure costs substantially over the next 24 months. The leaders who do not will discover the gap when their competitors start shipping comparable capabilities at meaningfully lower unit economics.
So one question for any AI architecture leader reviewing the model selection layer of your production stack: of the reasoning, classification, and structured problem-solving workloads currently running on frontier LLMs in your environment, what fraction could be handled by a specialized 7-million-parameter model at one ten-thousandth the cost, and how much work would it take to find out?
