- ADR-0001: Multi-Agent Coworking Architecture with LLM Gateway Orchestrator - ADR-0002: Tier Assignment Strategy for Model Selection (cost-first escalation) - ADR-0003: Confidence Gate Thresholds & Learning Cycle Intervals (6h/12h/24h cycles) - ADR-0004: External Provider Fallback Chain Ordering (Cerebras → Groq → Mistral) - Enhanced client SDK: Offline Ollama fallback, health checks, exponential backoff retry - Integration tests: claude-code-integration.test.ts (14 test cases) - PHASE_2F_DEPLOYMENT.md: Pre-deployment checklist, automated deploy, rollback plan - Post-deployment verification procedures for health, client fallback, metrics
140 lines
6.4 KiB
Markdown
140 lines
6.4 KiB
Markdown
# ADR-0003: Confidence Gate Thresholds & Learning Cycle Intervals
|
||
|
||
**Date**: 2026-04-19
|
||
**Status**: accepted
|
||
**Deciders**: René Fichtmüller
|
||
|
||
## Context
|
||
|
||
The confidence gate (Stage 8 in completion pipeline) determines whether output is approved for direct delivery or queued for human review. The gate is critical for:
|
||
- Ensuring quality (prevent hallucinations, factual errors)
|
||
- Cost control (human review is expensive, so threshold must be precise)
|
||
- User trust (false positives destroy credibility)
|
||
|
||
**Current scoring factors** (23 dimensions per Stage 8):
|
||
- Validator pass/fail (facts, schema, grammar, tone, length, security, etc.)
|
||
- Input clarity (5–10 dimensions)
|
||
- Output coherence (5 dimensions)
|
||
- Task complexity (model match, context relevance)
|
||
- Historical accuracy (task_type-model pair performance)
|
||
|
||
**Learning cycles** must tune these weights:
|
||
- Short cycles (6h) → reactive, catch fresh degradation
|
||
- Medium cycles (12h) → strategic, smooth seasonal trends
|
||
- Long cycles (24h) → historical, guide model selection
|
||
|
||
## Decision
|
||
|
||
Implement **three-tier confidence gating with autonomous learning cycles**:
|
||
|
||
1. **Gate thresholds** (0–10 scale):
|
||
- **0–4**: pending_review (queue for human approval + learning)
|
||
- **4–7**: warning (deliver with confidence metadata, flag in dashboard)
|
||
- **7–10**: approved (deliver directly, log for metrics)
|
||
|
||
2. **Confidence scoring formula**:
|
||
```
|
||
base_score = (validators_passed / total_validators) × 8 + input_clarity × 0.5 + coherence × 1.5
|
||
final_score = base_score × (1 + task_accuracy_bonus - hallucination_penalty)
|
||
```
|
||
Where:
|
||
- `task_accuracy_bonus`: +0.5 if historical accuracy >90%, +0.2 if >75%
|
||
- `hallucination_penalty`: -0.5 if task requires_fact_check and validator failed
|
||
|
||
3. **Learning cycles**:
|
||
- **6h**: Adjust validator weights (which validators are most predictive?)
|
||
- **12h**: Recalibrate thresholds (is 5.0 still the right boundary for review?)
|
||
- **24h**: Assess model-tier assignments (should task_X move from fast → medium?)
|
||
|
||
4. **Feedback loop**:
|
||
- Human reviewers mark outputs as: approved, modified, rejected
|
||
- Learning engine correlates human decisions with confidence scores
|
||
- Thresholds drift automatically (e.g., if 80% of 5.0–5.5 are approved, raise threshold)
|
||
|
||
5. **Cold-start handling**:
|
||
- New task types → default to pending_review until 20+ human reviews collected
|
||
- New model combinations → higher threshold until historical data accumulates
|
||
|
||
## Alternatives Considered
|
||
|
||
### Alternative 1: Static Thresholds (no learning)
|
||
- **Pros**: Predictable, easy to reason about
|
||
- **Cons**: Doesn't adapt to model improvements, domain shifts, or drifting validator accuracy
|
||
- **Why not**: Gateway exists to learn; static thresholds waste the learning engine
|
||
|
||
### Alternative 2: Single-Threshold (0–5 review, 5–10 approve)
|
||
- **Pros**: Simpler rules, fewer parameters
|
||
- **Cons**: Loses nuance; "warning" bucket is valuable for monitoring edge cases
|
||
- **Why not**: Warning zone catches systematic issues (e.g., "all medium-tier outputs on fact-check are slightly off")
|
||
|
||
### Alternative 3: Per-Caller Thresholds (TIP uses 6.0, EO uses 4.0)
|
||
- **Pros**: Each caller can tune tolerance
|
||
- **Cons**: Inconsistent quality, hard to debug when results vary
|
||
- **Why not**: Gateway quality is uniform; thresholds should reflect task complexity, not caller whim
|
||
|
||
## Consequences
|
||
|
||
### Positive
|
||
- **Automatic adaptation**: Confidence thresholds self-tune to model quality over time
|
||
- **Learning visibility**: Dashboard shows why outputs were gated (which validator failed, etc.)
|
||
- **Cost optimized**: As models improve, fewer outputs queue for human review
|
||
- **Quality feedback loop**: Human reviews train the confidence scorer iteratively
|
||
|
||
### Negative
|
||
- **Delayed convergence**: Takes 1-2 weeks for learning cycles to stabilize thresholds
|
||
- Mitigate: Seed with domain expert estimates; learning adjusts from there
|
||
- **Threshold oscillation**: If human review feedback is noisy, thresholds drift erratically
|
||
- Mitigate: Smoothing filter (move thresholds by max ±0.1 per cycle)
|
||
- **Review queue backlog**: If 30% of requests pending_review, human team is bottleneck
|
||
- Mitigate: Escalate high-confidence 4.0–5.0 to 6.0 after 1 week of queue backup
|
||
|
||
### Risks
|
||
- **Feedback bias**: If certain human reviewers are harsher/more lenient, learning is skewed
|
||
- Mitigate: Track reviewer agreement (Cohen's kappa) and weight accordingly
|
||
- **Confidence score gaming**: If models learn to game the scoring formula, thresholds creep up
|
||
- Mitigate: Periodic validator audit; ensure validators measure actual quality
|
||
- **Cascading threshold drift**: If thresholds move in response to poor model tier assignment, learning mixes causes
|
||
- Mitigate: Separate tier learning (12h) from threshold learning (12h) with distinct signals
|
||
|
||
## Implementation Notes
|
||
|
||
1. **Scoring implementation** (existing in confidence-gate.ts):
|
||
- Track all 23 dimensions during completion
|
||
- Compute base_score from validators
|
||
- Apply bonuses/penalties based on historical task accuracy
|
||
- Return confidence + base_score + impacts (for UI debugging)
|
||
|
||
2. **Learning cycles** (learning-engine.ts):
|
||
```typescript
|
||
// 6h: Reweight validators
|
||
const validator_accuracy = await queryValidatorAccuracy(taskType);
|
||
weights.grammar = validator_accuracy['grammar'].precision;
|
||
|
||
// 12h: Adjust thresholds
|
||
const human_reviews = await queryHumanReviews(taskType, hours=48);
|
||
const current_threshold = thresholds[taskType].review;
|
||
const true_positive_rate = human_reviews.filter(r => r.human_approved && r.confidence > current_threshold).length / human_reviews.length;
|
||
if (true_positive_rate > 0.9) thresholds[taskType].review += 0.1;
|
||
|
||
// 24h: Assess model assignments
|
||
const perf = await queryModelPerformance(taskType);
|
||
if (perf.fast_confidence < 4.5) changeDefaultTier(taskType, 'fast', 'medium');
|
||
```
|
||
|
||
3. **Dashboard metrics**:
|
||
- Confidence score distribution (histogram)
|
||
- Review queue size and age
|
||
- Validator contribution to score (Shapley values or feature importance)
|
||
- Threshold history over time (chart showing drift)
|
||
- Human reviewer agreement rate
|
||
|
||
4. **Monitoring thresholds**:
|
||
- Alert if review queue >24h backlog
|
||
- Alert if any threshold drifts >0.5 in 24h (possible feedback bias)
|
||
- Alert if validator accuracy drops >10% (model degradation signal)
|
||
|
||
## Related Decisions
|
||
- ADR-0001: Multi-Agent Coworking Architecture
|
||
- ADR-0002: Tier assignment strategy
|
||
- ADR-0004: External provider fallback chain ordering
|