llm-gateway/docs/adr/0002-tier-assignment-strategy.md
Rene Fichtmueller 2ca77d0aee feat: Phase 2F — Multi-Agent Integration (ADRs + Client Fallback + Tests)
- ADR-0001: Multi-Agent Coworking Architecture with LLM Gateway Orchestrator
- ADR-0002: Tier Assignment Strategy for Model Selection (cost-first escalation)
- ADR-0003: Confidence Gate Thresholds & Learning Cycle Intervals (6h/12h/24h cycles)
- ADR-0004: External Provider Fallback Chain Ordering (Cerebras → Groq → Mistral)
- Enhanced client SDK: Offline Ollama fallback, health checks, exponential backoff retry
- Integration tests: claude-code-integration.test.ts (14 test cases)
- PHASE_2F_DEPLOYMENT.md: Pre-deployment checklist, automated deploy, rollback plan
- Post-deployment verification procedures for health, client fallback, metrics
2026-04-19 21:39:44 +02:00

4.7 KiB

ADR-0002: Tier Assignment Strategy for Model Selection

Date: 2026-04-19
Status: accepted
Deciders: René Fichtmüller

Context

The Gateway must decide which Ollama model to use for each request. Current routing system defines three tiers:

  • fast: qwen2.5:3b (1.5B parameters, 5-10 tokens/sec, <500ms latency)
  • medium: qwen2.5:14b (14B parameters, 1-5 tokens/sec, 1-3s latency)
  • large: llama3.3:70b (70B parameters, 0.2-1 token/sec, 5-10s latency)

Decision factors:

  • Task complexity (learned from historical confidence scores)
  • Input length (longer inputs → need more capable models)
  • Request latency SLA (e.g., PeerCortex needs <8s response)
  • Fallback availability (if medium fails, should we degrade or escalate?)
  • Cost optimization (fast is 3x cheaper than medium, 10x cheaper than large)

Decision

Implement cost-first tier assignment with confidence-based escalation:

  1. Default: Start with fast tier (qwen2.5:3b)
  2. Escalate if needed:
    • Task type unknown → classify first, then assign tier
    • Historical confidence <5 on fast → try medium
    • Input length >2000 tokens → start with medium
    • Task requires_fact_check=true → use large
  3. Fallback chain per tier:
    • fast fails → try medium
    • medium fails → try large
    • large fails → external provider fallback (Cerebras → Groq → Mistral → NVIDIA → CF)
  4. Learning feedback loop:
    • Track confidence per (task_type, model) pair
    • Every 6h: if fast confidence <4.5 on task X, escalate task X to medium default
    • Every 12h: if medium confidence <3 on task X, escalate to large default

Alternatives Considered

Alternative 1: Complexity-First (classify every request)

  • Pros: More accurate initial tier selection
  • Cons: 500-1000ms pre-classification overhead, many classifications are themselves uncertain
  • Why not: Cost of classification offsets benefit; learning loop converges faster if we start cheap

Alternative 2: Confidence-Only (no task-specific routing)

  • Pros: Simpler rules, single model per confidence threshold
  • Cons: One task X might work well on fast but another task Y needs large; no specialization
  • Why not: Tasks have vastly different complexity profiles; generic threshold wastes cost

Alternative 3: Per-Caller Profiles (each caller gets a static tier)

  • Pros: Predictable cost per caller, easy billing
  • Cons: Ignores actual task complexity; TIP scraping is diverse, shouldn't pay for all-large
  • Why not: Multi-tenant fairness requires per-task routing

Consequences

Positive

  • Cost optimized: ~70% of requests handled by fast tier
  • Quality maintained: Learning loop escalates low-confidence tasks automatically
  • Responsive: Most requests complete in <1s
  • Fair: Each task gets the cheapest tier that works

Negative

  • Cold-start problem: New task types default to fast even if they need large
    • Mitigate: Classify task type on first few requests, then learn
  • Noisy escalation: If confidence threshold is wrong, thrash between tiers
    • Mitigate: Threshold tuning in ADR-0003
  • Fallback chain latency: Escalating from fast → medium → large → external can add 10-30s
    • Mitigate: Parallel fallback execution (fire medium request while fast is running)

Risks

  • Cascading failures: If fast tier bottleneck is hit, escalation floods medium
    • Mitigate: Per-tier rate limiting + circuit breaker pattern
  • User experience cliff: Fast tier latency jitter (100-1000ms variance) unpredictable
    • Mitigate: SLA target enforcement; if P95 latency >SLA, escalate whole task type

Implementation Notes

  1. Tier mapping in router.ts:

    interface TierAssignment {
      task_type: string;
      tier: ModelTier;
      fallback_chain: string[];
      escalation_condition?: string; // e.g. "confidence < 4.5"
    }
    
  2. Learning engine (6h cycle):

    • Query audit_log for (task_type, model) confidence distribution
    • If P50 confidence on fast:task_X < 4.5 for >20 requests, mark task_X → medium
    • If P50 confidence on medium:task_X < 3 for >20 requests, mark task_X → large
  3. Metrics to track:

    • Per-tier latency distribution (P50, P95, P99)
    • Per-task escalation rate (% fast→medium, medium→large)
    • Cost-per-task across tiers
  4. UI/Dashboard:

    • Show tier assignment rules (current state + learning history)
    • Plot task_type confidence distributions (binned by model)
    • Show cost savings vs "all-large" baseline
  • ADR-0001: Multi-Agent Coworking Architecture
  • ADR-0003: Confidence gate thresholds & learning cycle intervals
  • ADR-0004: External provider fallback chain ordering