POST 001·RESULT·JULY 2026

A Model That Designs, and Knows When to Stop

Arch Research

ABSTRACT

We trained a small model to design load-bearing structures from written requirements. Its designs are scored one way only: a physics solver either confirms the structure stands within every stated limit, or fails it. Across eight independent training runs, the model produced passing structures for 88 percent of requirements it had never seen, where random guessing passes 2 percent. It also showed the behavior our company was built around: it spends effort in proportion to how hard the job is, and it decides for itself when the design is done.

The task

Each requirement reads like an order from an engineer: reach this far from a wall, carry this load at the tip, stay under this weight. The model answers with a complete structure, the positions of its joints, which beams connect them, and how thick each beam is. A structural solver then applies the load and checks every beam against the strength of steel, the allowed sag, and the weight budget.

This scoring cannot be argued with. There is no benchmark to memorize and no grader to persuade. The structure stands or it does not.

Results

Measure	Value
Passing designs on unseen requirements, mean of 8 runs	0.882
Spread across the 8 runs	0.019
Weakest single run	0.843
Random designs of the same size, for comparison	0.023

Every run cleared the pre-registered bar. The residual failures are dominated by narrow misses on the weight budget, which the task sets deliberately tight.

Effort that matches the work

The result we consider most important is not the pass rate. When the model designs step by step, the number of thinking rounds it chooses tracks how much design work the requirement actually needs, with a correlation of 0.985 to 0.991 on every one of the eight runs.

Light load, short reach1 round

Long reach3 rounds

Heavy load8 rounds

Longest reach in the set12 rounds

A trivial order gets one thought. The hardest order in the evaluation gets twelve. Nobody told the model which orders were hard; the physics teacher did.

Knowing when to stop turned out to be part of being right.

The stopping decision is not cosmetic. When we forced the model to keep refining past the point where it had declared the design done, the designs got worse on every one of the eight runs, in the worst cases catastrophically. The same effect appears in our reasoning models, where running to maximum depth collapses accuracy. A model that cannot stop is not merely wasteful; it is less correct.

The same method holds for electronics

We ran the identical discipline on a second domain, simple filter circuits designed to a stated frequency response and verified in simulation. Across eight runs the model met 96.2 percent of unseen specifications, with a spread of 0.2 percent, against a 2.6 percent random floor.

What this does not show

We state the limits as plainly as the results. The structures belong to one family of truss designs, and the model chooses dimensions and thicknesses within that family; it does not invent new layouts. Everything runs at small scale on a single consumer graphics card, and verification is simulation, not fabrication. Most importantly, when a requirement demands loads well beyond anything in training, pass rates fall sharply. We measured that failure carefully, ruled out the easy explanations one by one, and consider it the central open problem of this line of work. Progress on it will be reported the same way as everything above, as a number against a fair baseline.

How to check us

Every figure in this writeup comes from a logged run with fixed seeds. Under a confidentiality agreement, a technical reviewer can rerun the training, the evaluation, and the physics checks on their own hardware and reproduce each number.

CITE

Arch Research (2026). A Model That Designs, and Knows When to Stop. Arch Research.