ARCH RESEARCH
← RESEARCH
POST 002·METHOD·JUNE 2026

How We Test Whether an AI Is Actually Reasoning

Arch Research
ABSTRACT

When an AI answers a question correctly, there are two very different things that could be happening. It might be reasoning — working through the problem step by step. Or it might be remembering — having seen something close enough during training to recognize the answer. From the outside, these look identical. The model says the right thing either way. But they are not the same, and the difference is everything.

A model that memorizes is a very expensive lookup table. It works beautifully on questions like the ones it studied, and falls apart on anything genuinely new. A model that reasons can handle problems it has never seen, because it learned how to solve, not what the answers were. If you want AI you can trust on real problems, you need to know which one you actually have.

The test that tells them apart

The trick is simple, and we use it on everything we build. You train the model on problems up to a certain difficulty — say, reasoning chains four steps long. Then you test it on problems that are harder than anything it trained on — chains six, eight, ten steps long.

A memorizer cannot fake this. It never saw the harder problems, so it has nothing to look up. If accuracy holds as the problems get harder, the model is genuinely reasoning. If it collapses the moment it leaves familiar territory, it was memorizing all along. We call this held-out generalization, and it is the single most honest test of whether a model can think.

Train on the easy cases, test on the hard ones. Memorization cannot survive that.

Why this matters more than a high score

It is easy to make a model look smart. Train it on enough examples, test it on similar examples, and it will post an impressive number. Plenty of AI demos work exactly this way — and quietly fall over when faced with something truly new.

We refuse to report numbers that way. Every capability claim we make is measured on problems the model has never seen, harder than the ones it trained on. When we say a model reasons, it means we watched it solve problems it had no way to memorize. When it fails that test, we say so — because a true claim that survives a hard test is worth more than an impressive claim that doesn't.

The honest version

This is also why we are careful about what we don't claim. Held-out testing keeps us honest with ourselves. It is very satisfying to see a model hit a high score; it is much less satisfying to watch that score evaporate when you make the problems harder. But that second number is the real one. It is the difference between an AI that performs and an AI that understands — and we would rather tell you the truth about the second than sell you the first.

CITE
Arch Research (2026). How We Test Whether an AI Is Actually Reasoning. Arch Research.