We trained a small model to design load-bearing structures from written requirements. Its designs are scored one way only: a physics solver either confirms the structure stands within every stated limit, or fails it. Across eight independent training runs, the model produced passing structures for 88 percent of requirements it had never seen, where random guessing passes 2 percent. It also showed the behavior our company was built around: it spends effort in proportion to how hard the job is, and it decides for itself when the design is done.
Each requirement reads like an order from an engineer: reach this far from a wall, carry this load at the tip, stay under this weight. The model answers with a complete structure, the positions of its joints, which beams connect them, and how thick each beam is. A structural solver then applies the load and checks every beam against the strength of steel, the allowed sag, and the weight budget.
This scoring cannot be argued with. There is no benchmark to memorize and no grader to persuade. The structure stands or it does not.
| Measure | Value |
|---|---|
| Passing designs on unseen requirements, mean of 8 runs | 0.882 |
| Spread across the 8 runs | 0.019 |
| Weakest single run | 0.843 |
| Random designs of the same size, for comparison | 0.023 |
Every run cleared the pre-registered bar. The residual failures are dominated by narrow misses on the weight budget, which the task sets deliberately tight.
The result we consider most important is not the pass rate. When the model designs step by step, the number of thinking rounds it chooses tracks how much design work the requirement actually needs, with a correlation of 0.985 to 0.991 on every one of the eight runs.
A trivial order gets one thought. The hardest order in the evaluation gets twelve. Nobody told the model which orders were hard; the physics teacher did.
Knowing when to stop turned out to be part of being right.
The stopping decision is not cosmetic. When we forced the model to keep refining past the point where it had declared the design done, the designs got worse on every one of the eight runs, in the worst cases catastrophically. The same effect appears in our reasoning models, where running to maximum depth collapses accuracy. A model that cannot stop is not merely wasteful; it is less correct.
We ran the identical discipline on a second domain, simple filter circuits designed to a stated frequency response and verified in simulation. Across eight runs the model met 96.2 percent of unseen specifications, with a spread of 0.2 percent, against a 2.6 percent random floor.
We state the limits as plainly as the results. The structures belong to one family of truss designs, and the model chooses dimensions and thicknesses within that family; it does not invent new layouts. Everything runs at small scale on a single consumer graphics card, and verification is simulation, not fabrication. Most importantly, when a requirement demands loads well beyond anything in training, pass rates fall sharply. We measured that failure carefully, ruled out the easy explanations one by one, and consider it the central open problem of this line of work. Progress on it will be reported the same way as everything above, as a number against a fair baseline.
Every figure in this writeup comes from a logged run with fixed seeds. Under a confidentiality agreement, a technical reviewer can rerun the training, the evaluation, and the physics checks on their own hardware and reproduce each number.