llm-gateway/docs/adr/0003-confidence-gate-thresholds.md
Rene Fichtmueller 2ca77d0aee feat: Phase 2F — Multi-Agent Integration (ADRs + Client Fallback + Tests)
- ADR-0001: Multi-Agent Coworking Architecture with LLM Gateway Orchestrator
- ADR-0002: Tier Assignment Strategy for Model Selection (cost-first escalation)
- ADR-0003: Confidence Gate Thresholds & Learning Cycle Intervals (6h/12h/24h cycles)
- ADR-0004: External Provider Fallback Chain Ordering (Cerebras → Groq → Mistral)
- Enhanced client SDK: Offline Ollama fallback, health checks, exponential backoff retry
- Integration tests: claude-code-integration.test.ts (14 test cases)
- PHASE_2F_DEPLOYMENT.md: Pre-deployment checklist, automated deploy, rollback plan
- Post-deployment verification procedures for health, client fallback, metrics
2026-04-19 21:39:44 +02:00

140 lines
6.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-0003: Confidence Gate Thresholds & Learning Cycle Intervals
**Date**: 2026-04-19
**Status**: accepted
**Deciders**: René Fichtmüller
## Context
The confidence gate (Stage 8 in completion pipeline) determines whether output is approved for direct delivery or queued for human review. The gate is critical for:
- Ensuring quality (prevent hallucinations, factual errors)
- Cost control (human review is expensive, so threshold must be precise)
- User trust (false positives destroy credibility)
**Current scoring factors** (23 dimensions per Stage 8):
- Validator pass/fail (facts, schema, grammar, tone, length, security, etc.)
- Input clarity (510 dimensions)
- Output coherence (5 dimensions)
- Task complexity (model match, context relevance)
- Historical accuracy (task_type-model pair performance)
**Learning cycles** must tune these weights:
- Short cycles (6h) → reactive, catch fresh degradation
- Medium cycles (12h) → strategic, smooth seasonal trends
- Long cycles (24h) → historical, guide model selection
## Decision
Implement **three-tier confidence gating with autonomous learning cycles**:
1. **Gate thresholds** (010 scale):
- **04**: pending_review (queue for human approval + learning)
- **47**: warning (deliver with confidence metadata, flag in dashboard)
- **710**: approved (deliver directly, log for metrics)
2. **Confidence scoring formula**:
```
base_score = (validators_passed / total_validators) × 8 + input_clarity × 0.5 + coherence × 1.5
final_score = base_score × (1 + task_accuracy_bonus - hallucination_penalty)
```
Where:
- `task_accuracy_bonus`: +0.5 if historical accuracy >90%, +0.2 if >75%
- `hallucination_penalty`: -0.5 if task requires_fact_check and validator failed
3. **Learning cycles**:
- **6h**: Adjust validator weights (which validators are most predictive?)
- **12h**: Recalibrate thresholds (is 5.0 still the right boundary for review?)
- **24h**: Assess model-tier assignments (should task_X move from fast → medium?)
4. **Feedback loop**:
- Human reviewers mark outputs as: approved, modified, rejected
- Learning engine correlates human decisions with confidence scores
- Thresholds drift automatically (e.g., if 80% of 5.05.5 are approved, raise threshold)
5. **Cold-start handling**:
- New task types → default to pending_review until 20+ human reviews collected
- New model combinations → higher threshold until historical data accumulates
## Alternatives Considered
### Alternative 1: Static Thresholds (no learning)
- **Pros**: Predictable, easy to reason about
- **Cons**: Doesn't adapt to model improvements, domain shifts, or drifting validator accuracy
- **Why not**: Gateway exists to learn; static thresholds waste the learning engine
### Alternative 2: Single-Threshold (05 review, 510 approve)
- **Pros**: Simpler rules, fewer parameters
- **Cons**: Loses nuance; "warning" bucket is valuable for monitoring edge cases
- **Why not**: Warning zone catches systematic issues (e.g., "all medium-tier outputs on fact-check are slightly off")
### Alternative 3: Per-Caller Thresholds (TIP uses 6.0, EO uses 4.0)
- **Pros**: Each caller can tune tolerance
- **Cons**: Inconsistent quality, hard to debug when results vary
- **Why not**: Gateway quality is uniform; thresholds should reflect task complexity, not caller whim
## Consequences
### Positive
- **Automatic adaptation**: Confidence thresholds self-tune to model quality over time
- **Learning visibility**: Dashboard shows why outputs were gated (which validator failed, etc.)
- **Cost optimized**: As models improve, fewer outputs queue for human review
- **Quality feedback loop**: Human reviews train the confidence scorer iteratively
### Negative
- **Delayed convergence**: Takes 1-2 weeks for learning cycles to stabilize thresholds
- Mitigate: Seed with domain expert estimates; learning adjusts from there
- **Threshold oscillation**: If human review feedback is noisy, thresholds drift erratically
- Mitigate: Smoothing filter (move thresholds by max ±0.1 per cycle)
- **Review queue backlog**: If 30% of requests pending_review, human team is bottleneck
- Mitigate: Escalate high-confidence 4.05.0 to 6.0 after 1 week of queue backup
### Risks
- **Feedback bias**: If certain human reviewers are harsher/more lenient, learning is skewed
- Mitigate: Track reviewer agreement (Cohen's kappa) and weight accordingly
- **Confidence score gaming**: If models learn to game the scoring formula, thresholds creep up
- Mitigate: Periodic validator audit; ensure validators measure actual quality
- **Cascading threshold drift**: If thresholds move in response to poor model tier assignment, learning mixes causes
- Mitigate: Separate tier learning (12h) from threshold learning (12h) with distinct signals
## Implementation Notes
1. **Scoring implementation** (existing in confidence-gate.ts):
- Track all 23 dimensions during completion
- Compute base_score from validators
- Apply bonuses/penalties based on historical task accuracy
- Return confidence + base_score + impacts (for UI debugging)
2. **Learning cycles** (learning-engine.ts):
```typescript
// 6h: Reweight validators
const validator_accuracy = await queryValidatorAccuracy(taskType);
weights.grammar = validator_accuracy['grammar'].precision;
// 12h: Adjust thresholds
const human_reviews = await queryHumanReviews(taskType, hours=48);
const current_threshold = thresholds[taskType].review;
const true_positive_rate = human_reviews.filter(r => r.human_approved && r.confidence > current_threshold).length / human_reviews.length;
if (true_positive_rate > 0.9) thresholds[taskType].review += 0.1;
// 24h: Assess model assignments
const perf = await queryModelPerformance(taskType);
if (perf.fast_confidence < 4.5) changeDefaultTier(taskType, 'fast', 'medium');
```
3. **Dashboard metrics**:
- Confidence score distribution (histogram)
- Review queue size and age
- Validator contribution to score (Shapley values or feature importance)
- Threshold history over time (chart showing drift)
- Human reviewer agreement rate
4. **Monitoring thresholds**:
- Alert if review queue >24h backlog
- Alert if any threshold drifts >0.5 in 24h (possible feedback bias)
- Alert if validator accuracy drops >10% (model degradation signal)
## Related Decisions
- ADR-0001: Multi-Agent Coworking Architecture
- ADR-0002: Tier assignment strategy
- ADR-0004: External provider fallback chain ordering