Model selection is a game of trade-offs, anchored by the Intelligence–Cost–Latency triangle: you cannot maximize all three.
Define intelligence, latency (average vs tail, TTFT vs completion), and cost (tokens, outliers, people time) for your use case before debating vendors.
Plot candidates on a model performance frontier (Pareto-efficient set) and choose among non-dominated options with product leadership.
Architecture and UX can “cheat” the triangle with routing, cascades, parallel specialists, and escalation to larger models when needed.
Provider strategy should be deliberate: ad hoc per-project provider sprawl does not scale.
Self-host only when ROI clears the overhead; advanced stacks are technical debt you accept for specific gains.
Most value still comes from context engineering, modest tuning, caching, and simple routing. Review custom stacks quarterly as baselines move.
Principles from the chapter
The Intelligence-Cost-Latency Triangle: You can buy raw intelligence, you can buy speed, and you can buy at bargain-bin prices… but you can’t buy all three at once.
Depending on the use case specifics, your teams can often cheat the Intelligence-Cost-Latency trade-off through architectural patterns and user experience design.
Allowing teams to pick AI providers ad hoc for each project will not scale, leaving projects brittle and limited by both performance and maintenance requirements.
Self-hosting can become a trap when teams pursue it for engineering prestige rather than technical necessity. Verify the return on investment justifies the overhead.
Complexity compounds obligation. Advanced configurations should be treated as technical debt you willingly accept in exchange for critical performance gains.
Read the chapter for…
Sample frontier tables, modality and compliance checklists, provider landscape comparisons, compression and routing patterns, and the multi-model travel-planner style example.