- ADR-0001: Multi-Agent Coworking Architecture with LLM Gateway Orchestrator - ADR-0002: Tier Assignment Strategy for Model Selection (cost-first escalation) - ADR-0003: Confidence Gate Thresholds & Learning Cycle Intervals (6h/12h/24h cycles) - ADR-0004: External Provider Fallback Chain Ordering (Cerebras → Groq → Mistral) - Enhanced client SDK: Offline Ollama fallback, health checks, exponential backoff retry - Integration tests: claude-code-integration.test.ts (14 test cases) - PHASE_2F_DEPLOYMENT.md: Pre-deployment checklist, automated deploy, rollback plan - Post-deployment verification procedures for health, client fallback, metrics
110 lines
4.7 KiB
Markdown
110 lines
4.7 KiB
Markdown
# ADR-0002: Tier Assignment Strategy for Model Selection
|
|
|
|
**Date**: 2026-04-19
|
|
**Status**: accepted
|
|
**Deciders**: René Fichtmüller
|
|
|
|
## Context
|
|
|
|
The Gateway must decide which Ollama model to use for each request. Current routing system defines three tiers:
|
|
- **fast**: qwen2.5:3b (1.5B parameters, 5-10 tokens/sec, <500ms latency)
|
|
- **medium**: qwen2.5:14b (14B parameters, 1-5 tokens/sec, 1-3s latency)
|
|
- **large**: llama3.3:70b (70B parameters, 0.2-1 token/sec, 5-10s latency)
|
|
|
|
**Decision factors:**
|
|
- Task complexity (learned from historical confidence scores)
|
|
- Input length (longer inputs → need more capable models)
|
|
- Request latency SLA (e.g., PeerCortex needs <8s response)
|
|
- Fallback availability (if medium fails, should we degrade or escalate?)
|
|
- Cost optimization (fast is 3x cheaper than medium, 10x cheaper than large)
|
|
|
|
## Decision
|
|
|
|
Implement **cost-first tier assignment with confidence-based escalation**:
|
|
|
|
1. **Default**: Start with **fast** tier (qwen2.5:3b)
|
|
2. **Escalate if needed**:
|
|
- Task type unknown → classify first, then assign tier
|
|
- Historical confidence <5 on fast → try medium
|
|
- Input length >2000 tokens → start with medium
|
|
- Task requires_fact_check=true → use large
|
|
3. **Fallback chain per tier**:
|
|
- fast fails → try medium
|
|
- medium fails → try large
|
|
- large fails → external provider fallback (Cerebras → Groq → Mistral → NVIDIA → CF)
|
|
4. **Learning feedback loop**:
|
|
- Track confidence per (task_type, model) pair
|
|
- Every 6h: if fast confidence <4.5 on task X, escalate task X to medium default
|
|
- Every 12h: if medium confidence <3 on task X, escalate to large default
|
|
|
|
## Alternatives Considered
|
|
|
|
### Alternative 1: Complexity-First (classify every request)
|
|
- **Pros**: More accurate initial tier selection
|
|
- **Cons**: 500-1000ms pre-classification overhead, many classifications are themselves uncertain
|
|
- **Why not**: Cost of classification offsets benefit; learning loop converges faster if we start cheap
|
|
|
|
### Alternative 2: Confidence-Only (no task-specific routing)
|
|
- **Pros**: Simpler rules, single model per confidence threshold
|
|
- **Cons**: One task X might work well on fast but another task Y needs large; no specialization
|
|
- **Why not**: Tasks have vastly different complexity profiles; generic threshold wastes cost
|
|
|
|
### Alternative 3: Per-Caller Profiles (each caller gets a static tier)
|
|
- **Pros**: Predictable cost per caller, easy billing
|
|
- **Cons**: Ignores actual task complexity; TIP scraping is diverse, shouldn't pay for all-large
|
|
- **Why not**: Multi-tenant fairness requires per-task routing
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
- **Cost optimized**: ~70% of requests handled by fast tier
|
|
- **Quality maintained**: Learning loop escalates low-confidence tasks automatically
|
|
- **Responsive**: Most requests complete in <1s
|
|
- **Fair**: Each task gets the cheapest tier that works
|
|
|
|
### Negative
|
|
- **Cold-start problem**: New task types default to fast even if they need large
|
|
- Mitigate: Classify task type on first few requests, then learn
|
|
- **Noisy escalation**: If confidence threshold is wrong, thrash between tiers
|
|
- Mitigate: Threshold tuning in ADR-0003
|
|
- **Fallback chain latency**: Escalating from fast → medium → large → external can add 10-30s
|
|
- Mitigate: Parallel fallback execution (fire medium request while fast is running)
|
|
|
|
### Risks
|
|
- **Cascading failures**: If fast tier bottleneck is hit, escalation floods medium
|
|
- Mitigate: Per-tier rate limiting + circuit breaker pattern
|
|
- **User experience cliff**: Fast tier latency jitter (100-1000ms variance) unpredictable
|
|
- Mitigate: SLA target enforcement; if P95 latency >SLA, escalate whole task type
|
|
|
|
## Implementation Notes
|
|
|
|
1. **Tier mapping in router.ts**:
|
|
```typescript
|
|
interface TierAssignment {
|
|
task_type: string;
|
|
tier: ModelTier;
|
|
fallback_chain: string[];
|
|
escalation_condition?: string; // e.g. "confidence < 4.5"
|
|
}
|
|
```
|
|
|
|
2. **Learning engine (6h cycle)**:
|
|
- Query audit_log for (task_type, model) confidence distribution
|
|
- If P50 confidence on fast:task_X < 4.5 for >20 requests, mark task_X → medium
|
|
- If P50 confidence on medium:task_X < 3 for >20 requests, mark task_X → large
|
|
|
|
3. **Metrics to track**:
|
|
- Per-tier latency distribution (P50, P95, P99)
|
|
- Per-task escalation rate (% fast→medium, medium→large)
|
|
- Cost-per-task across tiers
|
|
|
|
4. **UI/Dashboard**:
|
|
- Show tier assignment rules (current state + learning history)
|
|
- Plot task_type confidence distributions (binned by model)
|
|
- Show cost savings vs "all-large" baseline
|
|
|
|
## Related Decisions
|
|
- ADR-0001: Multi-Agent Coworking Architecture
|
|
- ADR-0003: Confidence gate thresholds & learning cycle intervals
|
|
- ADR-0004: External provider fallback chain ordering
|