Model Routing: Using the Cheapest Model That Actually Solves the Task

Not every LLM call needs GPT-5.5. If you're doing keyword classification, entity extraction, or simple Q&A - you're paying for a model that's 10-40x more expensive than necessary on every call.

The cost case

Approximate input/output costs as of mid-2026:

GPT-5.5: ~$2.00 / $8.00 per million tokens
Claude Sonnet 4.6: ~$0.80 / $4.00 per million tokens
Gemini 3.5 Flash: ~$0.075 / $0.30 per million tokens
Llama 4 Scout (self-hosted): ~$0.05 / $0.05 per million tokens

For workloads where 70% of requests are simple classification or extraction, routing those to Gemini Flash or Claude Sonnet while keeping complex reasoning on GPT-5.5 can cut LLM spend by 60-80% with no quality degradation on complex tasks.

Routing rules in Autrace

# autrace-rules.yaml
routing:
  - id: route-classification
    match:
      metadata.task_type: ["classify", "extract", "summarize"]
      estimated_tokens: { max: 2000 }
    route_to:
      model: "google/gemini-3.5-flash"

  - id: route-complex-reasoning
    match:
      metadata.task_type: ["reason", "code", "analyze"]
    route_to:
      model: "openai/gpt-5.5"

  - id: route-default
    match: "*"
    route_to:
      model: "anthropic/claude-sonnet-4.6"

Fallback routing

routing:
  - id: ha-route
    match: "*"
    route_to:
      primary: "openai/gpt-5.5"
      fallback:
        - "anthropic/claude-opus-4.8"
        - "google/gemini-3-pro"
      on_error: ["rate_limit", "timeout", "server_error"]

Cost attribution

Every proxied request is logged with the actual model used, input token count, output token count, and estimated cost. Export to your data warehouse via the audit log export API to attribute LLM spend per team, feature, or user cohort.

← Back to blog Contact Us →