CostRoutingEngineering
Model Routing: Using the Cheapest Model That Actually Solves the Task
AE
Autrace EngineeringNot every LLM call needs GPT-4o. If you're doing keyword classification, entity extraction, or simple Q&A - you're paying for a model that's 10-40x more expensive than necessary on every call.
The cost case
Approximate input/output costs as of early 2026:
- GPT-4o: ~$2.50 / $10.00 per million tokens
- Claude 3.5 Haiku: ~$0.80 / $4.00 per million tokens
- Gemini 1.5 Flash: ~$0.075 / $0.30 per million tokens
- Llama 3.1 8B (self-hosted): ~$0.05 / $0.05 per million tokens
For workloads where 70% of requests are simple classification or extraction, routing those to Flash or Haiku while keeping complex reasoning on GPT-4o can cut LLM spend by 60-80% with no quality degradation on complex tasks.
Routing rules in Autrace
# autrace-rules.yaml
routing:
- id: route-classification
match:
metadata.task_type: ["classify", "extract", "summarize"]
estimated_tokens: { max: 2000 }
route_to:
model: "gemini/gemini-1.5-flash"
- id: route-complex-reasoning
match:
metadata.task_type: ["reason", "code", "analyze"]
route_to:
model: "openai/gpt-4o"
- id: route-default
match: "*"
route_to:
model: "anthropic/claude-3-haiku-20240307"Fallback routing
routing:
- id: ha-route
match: "*"
route_to:
primary: "openai/gpt-4o"
fallback:
- "anthropic/claude-3-5-sonnet-20241022"
- "google/gemini-1.5-pro"
on_error: ["rate_limit", "timeout", "server_error"]Cost attribution
Every proxied request is logged with the actual model used, input token count, output token count, and estimated cost. Export to your data warehouse via the audit log export API to attribute LLM spend per team, feature, or user cohort.