Frontier LLM inference prices face compression from open-weight releases and specialized silicon
Current frontier pricing embeds R&D recovery that open-weight models and efficiency gains remove. Capex trajectories at hyperscalers exceed attributable AI revenue, creating structural pressure for price reduction. Switching friction near zero accelerates substitution once benchmark parity is reached.
Uber exhausted its full-year AI budget in four months. Microsoft, Salesforce, and GitHub introduced internal controls to curb employee spend on frontier models. These incidents coincide with release of open-weight models such as GLM-5.2 that match or exceed GPT-5.5 and Claude Opus 4.8 on coding tasks at materially lower hosted rates. Microsoft fiscal 2025 capex rose above $50 billion, driven primarily by GPU clusters for training and inference. Revenue from Azure AI and Copilot services remains a fraction of that outlay. Open-weight releases shift the cost base to inference-only providers who apply modest markups on hardware whose per-token cost continues to fall with TPU and Cerebras wafer-scale deployments. Zero switching costs via gateways eliminate vendor lock-in that protected traditional SaaS. Model performance gains have narrowed since 2024; Claude Opus 4.8 carried the same list price as 4.7. Sustained capex at current levels requires either sustained price premiums or volume that open models and local deployments can undercut. Inference pricing is therefore expected to compress toward marginal hardware cost within 18 months.
OpenAI: average realized price per million tokens falls below $10 combined by Q2 2027 once open models hold 70 percent of top-5 benchmark slots.
Sources (3)
- [1]Primary Source(https://aditya.patadia.org/p/ai-and-cloud-costs)
- [2]Microsoft 10-K FY2025(https://www.sec.gov/Archives/edgar/data/789019/000095017025012345/msft-20250630.htm)
- [3]Cerebras Wafer-Scale Inference Benchmarks(https://cerebras.ai/blog/inference-performance-2025)