Short answer
The cheapest useful LLM system is not the one with the lowest model price. It is the one that routes the right task to the right model, caches predictable work, measures quality and stops expensive requests before they sprawl.
The four controls that matter most
1. Task-based routing
Do not send every request to the biggest model. Separate tasks into:
- simple classification
- summarization
- search or retrieval
- tool use
- long-form writing
Then route each task to the lightest model that still meets the quality bar.
2. Caching
Good systems usually combine:
- response caching for repeated prompts
- semantic caching for near-duplicate intent
- retrieval caching where the source corpus changes slowly
Caching is often the fastest cost win available.
3. Prompt discipline
Verbose prompts, unnecessary context windows and weak tool boundaries drive bills up quickly. Prompt design should aim for:
- tight instructions
- bounded context
- explicit output formats
- minimal token waste
4. Observability and budgets
You need visibility at the request level:
- cost per feature
- cost per user action
- latency per model
- failure rate
- prompt version used
Without this, finance sees a bill but product teams cannot explain which feature created it.
What buyers should ask vendors
- Which requests hit which model and why?
- What is cached and what is not?
- How are budgets enforced?
- What does a successful request cost today?
- How is quality checked when cheaper routing is introduced?