POST 004·RESULT·JUNE 2026

Reasoning Beyond What It Was Taught

Arch Research

ABSTRACT

The deepest question you can ask about an AI model is whether it actually reasons or merely remembers. The two look identical from the outside. A model that has memorized a vast number of examples will answer familiar questions flawlessly, and you might never realize that it has no idea what it is doing, because you keep asking it things it has effectively already seen. The difference only reveals itself when you push the model past the edge of its training, into territory it could not have memorized, and watch what happens.

Most models, pushed past that edge, fall apart. They were never reasoning in the first place; they were recognizing, and once there is nothing left to recognize, there is nothing left at all. We wanted to know whether our approach was different. So we designed a test specifically to push a model into the unknown and measure whether it could think its way through, or only collapse.

The test

We trained a model on a set of reasoning problems up to a certain difficulty, and only up to that difficulty. It had never seen anything harder. Then, at test time, we gave it problems substantially harder than anything in its training, and we ran it two different ways on those harder problems.

In the first way, we capped the model's thinking at the same limit it had during training. It was allowed only as much effort as it had ever used before. In the second way, we let it think longer, giving it the additional rounds of effort that the harder problems genuinely required. Everything else was identical. The only difference was whether the model was allowed to spend more thought on a problem that was bigger than anything it had been taught to handle.

The gap between those two conditions is the whole experiment. If the extra thinking did nothing, then the model was just a lookup table, and more time would not help, because there was no real reasoning to extend. But if the extra thinking actually solved the harder problems, then the model had learned something general, a way of reasoning that composes to lengths it was never explicitly shown.

The result

The result was about as clean as a result gets.

Capped at its training limitAbout 11% correct, barely above chance

Allowed to think longerEssentially perfect

On problems more than twice as long as anything it had trained on, the model held to its old thinking limit scored around eleven percent, which is to say it was guessing. The very same model, on the very same problems, allowed to think longer, scored essentially one hundred percent. Nothing about the model changed between those two numbers. We did not retrain it, did not add data, did not adjust a single one of its internal values. We simply let it spend more thought, and the harder problems went from nearly impossible to solved.

The same model, on the same unfamiliar problems, went from guessing to solving them, with no retraining at

all. The only thing that changed was that we let it think longer.

Why this is the result we are proudest of

This experiment gets at the heart of what makes our approach different, and it does so in a way that a conventional model fundamentally cannot match. A standard fixed-effort model has its depth of thinking frozen when it is built. It literally cannot choose to think longer on a harder problem, because the amount of thinking is not something it controls. Our approach can, and this experiment shows that the extra thinking is not idle churning. It is real computation that converts directly into solving problems the model was never taught to solve.

That is the difference between a system that reasons and a system that recalls. The reasoning step the model learned was general enough to extend itself, on its own, to difficulty it had never encountered, simply by being given the time. It learned how to think, not merely what to answer.

For us, this is the strongest single piece of evidence that we are building something real. It is one thing to be efficient, and another to be reliable, and we are both. But this is the result that says the underlying idea is not a trick or an optimization. It is a model that genuinely reasons, and whose reasoning grows with the effort it is allowed to spend. That property is exactly what you would want at the foundation of a system meant to get smarter over time, and it is measured, repeatable, and ours.

CITE

Arch Research (2026). Reasoning Beyond What It Was Taught. Arch Research.