# ADR-0003: Confidence Gate Thresholds & Learning Cycle Intervals **Date**: 2026-04-19 **Status**: accepted **Deciders**: René Fichtmüller ## Context The confidence gate (Stage 8 in completion pipeline) determines whether output is approved for direct delivery or queued for human review. The gate is critical for: - Ensuring quality (prevent hallucinations, factual errors) - Cost control (human review is expensive, so threshold must be precise) - User trust (false positives destroy credibility) **Current scoring factors** (23 dimensions per Stage 8): - Validator pass/fail (facts, schema, grammar, tone, length, security, etc.) - Input clarity (5–10 dimensions) - Output coherence (5 dimensions) - Task complexity (model match, context relevance) - Historical accuracy (task_type-model pair performance) **Learning cycles** must tune these weights: - Short cycles (6h) → reactive, catch fresh degradation - Medium cycles (12h) → strategic, smooth seasonal trends - Long cycles (24h) → historical, guide model selection ## Decision Implement **three-tier confidence gating with autonomous learning cycles**: 1. **Gate thresholds** (0–10 scale): - **0–4**: pending_review (queue for human approval + learning) - **4–7**: warning (deliver with confidence metadata, flag in dashboard) - **7–10**: approved (deliver directly, log for metrics) 2. **Confidence scoring formula**: ``` base_score = (validators_passed / total_validators) × 8 + input_clarity × 0.5 + coherence × 1.5 final_score = base_score × (1 + task_accuracy_bonus - hallucination_penalty) ``` Where: - `task_accuracy_bonus`: +0.5 if historical accuracy >90%, +0.2 if >75% - `hallucination_penalty`: -0.5 if task requires_fact_check and validator failed 3. **Learning cycles**: - **6h**: Adjust validator weights (which validators are most predictive?) - **12h**: Recalibrate thresholds (is 5.0 still the right boundary for review?) - **24h**: Assess model-tier assignments (should task_X move from fast → medium?) 4. **Feedback loop**: - Human reviewers mark outputs as: approved, modified, rejected - Learning engine correlates human decisions with confidence scores - Thresholds drift automatically (e.g., if 80% of 5.0–5.5 are approved, raise threshold) 5. **Cold-start handling**: - New task types → default to pending_review until 20+ human reviews collected - New model combinations → higher threshold until historical data accumulates ## Alternatives Considered ### Alternative 1: Static Thresholds (no learning) - **Pros**: Predictable, easy to reason about - **Cons**: Doesn't adapt to model improvements, domain shifts, or drifting validator accuracy - **Why not**: Gateway exists to learn; static thresholds waste the learning engine ### Alternative 2: Single-Threshold (0–5 review, 5–10 approve) - **Pros**: Simpler rules, fewer parameters - **Cons**: Loses nuance; "warning" bucket is valuable for monitoring edge cases - **Why not**: Warning zone catches systematic issues (e.g., "all medium-tier outputs on fact-check are slightly off") ### Alternative 3: Per-Caller Thresholds (TIP uses 6.0, EO uses 4.0) - **Pros**: Each caller can tune tolerance - **Cons**: Inconsistent quality, hard to debug when results vary - **Why not**: Gateway quality is uniform; thresholds should reflect task complexity, not caller whim ## Consequences ### Positive - **Automatic adaptation**: Confidence thresholds self-tune to model quality over time - **Learning visibility**: Dashboard shows why outputs were gated (which validator failed, etc.) - **Cost optimized**: As models improve, fewer outputs queue for human review - **Quality feedback loop**: Human reviews train the confidence scorer iteratively ### Negative - **Delayed convergence**: Takes 1-2 weeks for learning cycles to stabilize thresholds - Mitigate: Seed with domain expert estimates; learning adjusts from there - **Threshold oscillation**: If human review feedback is noisy, thresholds drift erratically - Mitigate: Smoothing filter (move thresholds by max ±0.1 per cycle) - **Review queue backlog**: If 30% of requests pending_review, human team is bottleneck - Mitigate: Escalate high-confidence 4.0–5.0 to 6.0 after 1 week of queue backup ### Risks - **Feedback bias**: If certain human reviewers are harsher/more lenient, learning is skewed - Mitigate: Track reviewer agreement (Cohen's kappa) and weight accordingly - **Confidence score gaming**: If models learn to game the scoring formula, thresholds creep up - Mitigate: Periodic validator audit; ensure validators measure actual quality - **Cascading threshold drift**: If thresholds move in response to poor model tier assignment, learning mixes causes - Mitigate: Separate tier learning (12h) from threshold learning (12h) with distinct signals ## Implementation Notes 1. **Scoring implementation** (existing in confidence-gate.ts): - Track all 23 dimensions during completion - Compute base_score from validators - Apply bonuses/penalties based on historical task accuracy - Return confidence + base_score + impacts (for UI debugging) 2. **Learning cycles** (learning-engine.ts): ```typescript // 6h: Reweight validators const validator_accuracy = await queryValidatorAccuracy(taskType); weights.grammar = validator_accuracy['grammar'].precision; // 12h: Adjust thresholds const human_reviews = await queryHumanReviews(taskType, hours=48); const current_threshold = thresholds[taskType].review; const true_positive_rate = human_reviews.filter(r => r.human_approved && r.confidence > current_threshold).length / human_reviews.length; if (true_positive_rate > 0.9) thresholds[taskType].review += 0.1; // 24h: Assess model assignments const perf = await queryModelPerformance(taskType); if (perf.fast_confidence < 4.5) changeDefaultTier(taskType, 'fast', 'medium'); ``` 3. **Dashboard metrics**: - Confidence score distribution (histogram) - Review queue size and age - Validator contribution to score (Shapley values or feature importance) - Threshold history over time (chart showing drift) - Human reviewer agreement rate 4. **Monitoring thresholds**: - Alert if review queue >24h backlog - Alert if any threshold drifts >0.5 in 24h (possible feedback bias) - Alert if validator accuracy drops >10% (model degradation signal) ## Related Decisions - ADR-0001: Multi-Agent Coworking Architecture - ADR-0002: Tier assignment strategy - ADR-0004: External provider fallback chain ordering