- ADR-0001: Multi-Agent Coworking Architecture with LLM Gateway Orchestrator - ADR-0002: Tier Assignment Strategy for Model Selection (cost-first escalation) - ADR-0003: Confidence Gate Thresholds & Learning Cycle Intervals (6h/12h/24h cycles) - ADR-0004: External Provider Fallback Chain Ordering (Cerebras → Groq → Mistral) - Enhanced client SDK: Offline Ollama fallback, health checks, exponential backoff retry - Integration tests: claude-code-integration.test.ts (14 test cases) - PHASE_2F_DEPLOYMENT.md: Pre-deployment checklist, automated deploy, rollback plan - Post-deployment verification procedures for health, client fallback, metrics
6.4 KiB
6.4 KiB
ADR-0003: Confidence Gate Thresholds & Learning Cycle Intervals
Date: 2026-04-19
Status: accepted
Deciders: René Fichtmüller
Context
The confidence gate (Stage 8 in completion pipeline) determines whether output is approved for direct delivery or queued for human review. The gate is critical for:
- Ensuring quality (prevent hallucinations, factual errors)
- Cost control (human review is expensive, so threshold must be precise)
- User trust (false positives destroy credibility)
Current scoring factors (23 dimensions per Stage 8):
- Validator pass/fail (facts, schema, grammar, tone, length, security, etc.)
- Input clarity (5–10 dimensions)
- Output coherence (5 dimensions)
- Task complexity (model match, context relevance)
- Historical accuracy (task_type-model pair performance)
Learning cycles must tune these weights:
- Short cycles (6h) → reactive, catch fresh degradation
- Medium cycles (12h) → strategic, smooth seasonal trends
- Long cycles (24h) → historical, guide model selection
Decision
Implement three-tier confidence gating with autonomous learning cycles:
-
Gate thresholds (0–10 scale):
- 0–4: pending_review (queue for human approval + learning)
- 4–7: warning (deliver with confidence metadata, flag in dashboard)
- 7–10: approved (deliver directly, log for metrics)
-
Confidence scoring formula:
base_score = (validators_passed / total_validators) × 8 + input_clarity × 0.5 + coherence × 1.5 final_score = base_score × (1 + task_accuracy_bonus - hallucination_penalty)Where:
task_accuracy_bonus: +0.5 if historical accuracy >90%, +0.2 if >75%hallucination_penalty: -0.5 if task requires_fact_check and validator failed
-
Learning cycles:
- 6h: Adjust validator weights (which validators are most predictive?)
- 12h: Recalibrate thresholds (is 5.0 still the right boundary for review?)
- 24h: Assess model-tier assignments (should task_X move from fast → medium?)
-
Feedback loop:
- Human reviewers mark outputs as: approved, modified, rejected
- Learning engine correlates human decisions with confidence scores
- Thresholds drift automatically (e.g., if 80% of 5.0–5.5 are approved, raise threshold)
-
Cold-start handling:
- New task types → default to pending_review until 20+ human reviews collected
- New model combinations → higher threshold until historical data accumulates
Alternatives Considered
Alternative 1: Static Thresholds (no learning)
- Pros: Predictable, easy to reason about
- Cons: Doesn't adapt to model improvements, domain shifts, or drifting validator accuracy
- Why not: Gateway exists to learn; static thresholds waste the learning engine
Alternative 2: Single-Threshold (0–5 review, 5–10 approve)
- Pros: Simpler rules, fewer parameters
- Cons: Loses nuance; "warning" bucket is valuable for monitoring edge cases
- Why not: Warning zone catches systematic issues (e.g., "all medium-tier outputs on fact-check are slightly off")
Alternative 3: Per-Caller Thresholds (TIP uses 6.0, EO uses 4.0)
- Pros: Each caller can tune tolerance
- Cons: Inconsistent quality, hard to debug when results vary
- Why not: Gateway quality is uniform; thresholds should reflect task complexity, not caller whim
Consequences
Positive
- Automatic adaptation: Confidence thresholds self-tune to model quality over time
- Learning visibility: Dashboard shows why outputs were gated (which validator failed, etc.)
- Cost optimized: As models improve, fewer outputs queue for human review
- Quality feedback loop: Human reviews train the confidence scorer iteratively
Negative
- Delayed convergence: Takes 1-2 weeks for learning cycles to stabilize thresholds
- Mitigate: Seed with domain expert estimates; learning adjusts from there
- Threshold oscillation: If human review feedback is noisy, thresholds drift erratically
- Mitigate: Smoothing filter (move thresholds by max ±0.1 per cycle)
- Review queue backlog: If 30% of requests pending_review, human team is bottleneck
- Mitigate: Escalate high-confidence 4.0–5.0 to 6.0 after 1 week of queue backup
Risks
- Feedback bias: If certain human reviewers are harsher/more lenient, learning is skewed
- Mitigate: Track reviewer agreement (Cohen's kappa) and weight accordingly
- Confidence score gaming: If models learn to game the scoring formula, thresholds creep up
- Mitigate: Periodic validator audit; ensure validators measure actual quality
- Cascading threshold drift: If thresholds move in response to poor model tier assignment, learning mixes causes
- Mitigate: Separate tier learning (12h) from threshold learning (12h) with distinct signals
Implementation Notes
-
Scoring implementation (existing in confidence-gate.ts):
- Track all 23 dimensions during completion
- Compute base_score from validators
- Apply bonuses/penalties based on historical task accuracy
- Return confidence + base_score + impacts (for UI debugging)
-
Learning cycles (learning-engine.ts):
// 6h: Reweight validators const validator_accuracy = await queryValidatorAccuracy(taskType); weights.grammar = validator_accuracy['grammar'].precision; // 12h: Adjust thresholds const human_reviews = await queryHumanReviews(taskType, hours=48); const current_threshold = thresholds[taskType].review; const true_positive_rate = human_reviews.filter(r => r.human_approved && r.confidence > current_threshold).length / human_reviews.length; if (true_positive_rate > 0.9) thresholds[taskType].review += 0.1; // 24h: Assess model assignments const perf = await queryModelPerformance(taskType); if (perf.fast_confidence < 4.5) changeDefaultTier(taskType, 'fast', 'medium'); -
Dashboard metrics:
- Confidence score distribution (histogram)
- Review queue size and age
- Validator contribution to score (Shapley values or feature importance)
- Threshold history over time (chart showing drift)
- Human reviewer agreement rate
-
Monitoring thresholds:
- Alert if review queue >24h backlog
- Alert if any threshold drifts >0.5 in 24h (possible feedback bias)
- Alert if validator accuracy drops >10% (model degradation signal)
Related Decisions
- ADR-0001: Multi-Agent Coworking Architecture
- ADR-0002: Tier assignment strategy
- ADR-0004: External provider fallback chain ordering