The Physical Stack of AI · The inference economy

Neoclouds vs hyperscalers

You can explain what a "neocloud" is, name the major players, and tell when raw-GPU pricing beats hyperscaler bundling — and when it doesn't.

There are now, broadly, two places you can rent an Nvidia GPU. The first is a hyperscaler — AWS, Azure, GCP, Oracle — where the GPU sits inside a portfolio of managed services: identity, networking, databases, model APIs, procurement, and SLAs that lawyers will sign. The second is a neocloud — CoreWeave, Lambda, RunPod, Together AI, Modal, Crusoe, Nebius, Spheron, vast.ai — whose pitch is the opposite: GPU access first, with less of the traditional cloud bundle attached.

The price comparison is not one stable number. Lambda prices AI-cloud systems by GPU family, count, and commitment window. Runpod splits Pods, Serverless, and Clusters. CoreWeave lists on-demand, spot, and inference-oriented pricing by accelerator. Modal prices serverless compute for spiky workloads. Together AI mixes token-priced inference, dedicated endpoints, fine-tuning, and GPU clusters. AWS capacity blocks price reserved accelerator windows differently again. A serious comparison normalizes the same GPU class, GPU count, region, reservation length, interconnect, storage, egress, support, and utilization.

The trade is real on both sides. Neoclouds can win when the workload is mostly raw accelerator time, especially for open-model training, bursty inference, and experiments that do not need a large enterprise cloud footprint. Hyperscalers bring everything else — IAM, audit logs, regulated-industry compliance, proprietary accelerator stacks, and managed model endpoints like Bedrock, Azure AI Foundry, and Google's agent platform.

This chapter's three lessons walk:

Chapter contains 3 lessons.