The Physical Stack of AI · The inference economy
You can pick up any 2026 LLM pricing page and tell — in 30 seconds — what an answer will cost, where the hidden discounts live, and how that vendor positions against the rest of the market.
A pricing page looks simple: a model name, a dollar number for input, a dollar number for output, both per million tokens. Behind those four numbers sit four design choices that determine whether your product is viable.
The first is the ratio. Output costs roughly 3–8× input across every major lab in 2026: Claude Opus 4.7 at \$5/\$25, Sonnet 4.6 at \$3/\$15, Haiku 4.5 at \$1/\$5; GPT-5.5 at \$5/\$30; GPT-5.2 Pro at an outlier \$21/\$168. Long answers, long chains-of-thought, agent loops with many tool calls — these are the bills that get scary.
The second is the tier shape. Every major lab now ships a Haiku/Sonnet/Opus-style ladder, with the cheapest tier 15–30× cheaper than the flagship. Routing easy traffic to the small model is the single biggest cost lever most products have.
The third is the discount stack. Anthropic's pricing docs list three: prompt caching (up to 90% off repeated context), batch APIs (50% off both sides for tolerant workloads), and context-window discounts on long inputs. Stack them and the published \$25/Mtok output can land below \$10 in production.
This chapter walks all of that, in four short lessons.
Type: multi-choice
Chapter contains 4 lessons.