The cost of building frontier AI is rising faster than almost any technology in history. The compute cost of the largest training runs is doubling roughly every eight months — about 2.4x per year. The leading labs already run training jobs that cost on the order of a billion dollars, and industry leaders openly project ten- to hundred-billion-dollar models within a few years, with total sector spending approaching a trillion dollars by 2027.
That trajectory is decided by capital, and it is not a race we can win by joining it. Arch is built on a different premise: efficiency per parameter. Rather than buying capability with ever-larger models and ever-larger budgets, we extract more capability from each parameter and each unit of compute. This document presents what we have measured — on a single consumer GPU, for under $50 — and what a grant would let us prove at scale.
We do not claim to rival a frontier model. We claim something more fundable: a measured, reproducible efficiency advantage, with every result produced by a named script and re-runnable by a reviewer. Negative results are retained, not discarded.
The frontier developers scale by adding parameters, data, and dollars. We pursue a different lever — getting more out of each parameter that already exists.
Our architecture, RDA (Recurrent Depth-Adaptive), lets a model reason in iterative rounds and decide for itself when to stop, allocating more computation to difficult inputs and less to easy ones. The result is higher capability and lower runtime cost per parameter.
| Approach | The lever | The result |
|---|---|---|
| The contenders | More parameters + more data + more dollars | Capability |
| Arch | A better mechanism per parameter | Capability per dollar |
Every result below was produced on a consumer RTX 4060 for between $0 and $5, and is reproducible from the underlying scripts.
| Claim | Measured result | Method |
|---|---|---|
| Adapts computation to difficulty | Spearman correlation of 0.99 | Problem difficulty vs. rounds used |
| Cheaper than fixed-depth, equal accuracy | 42% average compute saved (up to 75% on easy inputs), at full accuracy | Adaptive halting, holds 5M–20M params |
| Outperforms the standard published method | 1.000 ± 0.000 vs. the baseline's 0.977 ± 0.058 | 8 seeds each, identical training |
| Reasons beyond its training distribution | Solves chains 2.5x longer than trained, by thinking longer | 3 seeds |
| Capability rises with scale | Held-out reasoning 0.33 to 0.43 (2.2M to 14.6M params) | 3 seeds each, non-overlapping error bars |
| Generalizes across task types | One model handles arithmetic and transitive logic, both generalizing | Single model, balanced data |
| Deterministic and reliable | 8 of 8 seeds converge; byte-identical reruns | Determinism stress test |
RDA spends computation in proportion to difficulty. Measured against a fixed-depth model that always runs at full depth, on identical problems at full accuracy:
On a realistic mixed workload, where most queries are easy, RDA uses roughly half the computation of a fixed-depth model for the same output. In commercial terms, that is a direct reduction in cost per query — and because it compounds at inference for the life of every deployment, it is a margin advantage that scaling parameters alone cannot produce. This matters precisely because the industry's own leaders now warn that inference, not training, is where total spending heads toward a trillion dollars.
This is the central result for an investor. We trained the same recipe at increasing model sizes and measured accuracy on reasoning chains longer than any seen during training — a test of genuine generalization, not memorization.
A 6.6x increase in parameters produced a measured gain of +0.10 in held-out reasoning accuracy, with non-overlapping error bars across three seeds — a real effect, not noise. The model's reasoning also became better-calibrated at the larger size. This is the concrete evidence that capital converts to capability: more parameters yield measurably more reasoning, on a task with genuine headroom.
We state the limit plainly: this is a two-point trend, measured locally on a structured reasoning task, and real-world scaling exhibits diminishing returns. We do not extrapolate it to a straight line. Extending this curve with real data at full scale is precisely the experiment a grant funds — and the point at which our per-parameter efficiency becomes the decisive advantage.
Even the small open models released by major labs are products of enormous data budgets — compact models in the one-to-two-billion-parameter range are routinely trained on eleven to eighteen trillion tokens. A new entrant cannot replicate that data and compute scale, and should not try. The opening is not to out-spend the incumbents on scale, but to out-engineer them on efficiency: a mechanism that delivers more capability per parameter is worth more, per dollar, at every size — and that advantage only grows as compute becomes the binding constraint on the entire industry.
We allocate capital the way the thesis demands: for maximum result per dollar.
| Allocation | Share | Purpose | Rationale |
|---|---|---|---|
| Training data | 35% | Quality tokens at scale | The proven bottleneck is data, not architecture |
| Compute | 35% | GPU hours; scale RDA from 100M to 1–3B across many runs | Each run extends the scaling curve above |
| Senior research hire | 20% | Approximately one researcher-year | Compounds output beyond any single GPU |
| Runway and operations | 10% | Entity, legal, 12–18 month buffer | Removes pressure to raise on poor terms |
Funds are spent across many small, measured runs rather than a single large gamble — every result de-risks the next dollar.
- The hard problems are already solved cheaply. Efficiency, generalization, the scaling lift, and reliability were established for near-zero cost on consumer hardware. That capital discipline is how we would steward grant funds. - Every claim is a number against a fair baseline. No hype, no claims of general intelligence, negative results logged. Any reviewer can reproduce any result from the underlying script. - The next milestone is legible and cheap. The decisive scale test costs single-digit dollars, not millions. A grant funds a clearly defined experiment with a measurable pass/fail criterion, not an open-ended bet.
We state our open questions directly, because credibility depends on it.
| Open question | Why it matters | Cost to answer |
|---|---|---|
| Does the efficiency edge hold at 100M+ parameters on real language? | The thesis at scale | A funded run on real data |
| Does a small RDA match a model several times its size? | The core intelligence-per-parameter claim | Scale run plus a fair larger baseline |
| Does the approach extend to a general assistant? | The product direction | A larger run on real data |
The mechanism is proven. Funding proves it at scale. That is the bet — and it is a measured one.
All Arch figures are reproducible from the project's research scripts and findings. Total spent to date: under $50, on a consumer GPU.