Rene Fichtmueller 2ca77d0aee feat: Phase 2F — Multi-Agent Integration (ADRs + Client Fallback + Tests)

- ADR-0001: Multi-Agent Coworking Architecture with LLM Gateway Orchestrator
- ADR-0002: Tier Assignment Strategy for Model Selection (cost-first escalation)
- ADR-0003: Confidence Gate Thresholds & Learning Cycle Intervals (6h/12h/24h cycles)
- ADR-0004: External Provider Fallback Chain Ordering (Cerebras → Groq → Mistral)
- Enhanced client SDK: Offline Ollama fallback, health checks, exponential backoff retry
- Integration tests: claude-code-integration.test.ts (14 test cases)
- PHASE_2F_DEPLOYMENT.md: Pre-deployment checklist, automated deploy, rollback plan
- Post-deployment verification procedures for health, client fallback, metrics

2026-04-19 21:39:44 +02:00

6.4 KiB

Raw Permalink Blame History

ADR-0003: Confidence Gate Thresholds & Learning Cycle Intervals

Date: 2026-04-19
Status: accepted
Deciders: René Fichtmüller

Context

The confidence gate (Stage 8 in completion pipeline) determines whether output is approved for direct delivery or queued for human review. The gate is critical for:

Ensuring quality (prevent hallucinations, factual errors)
Cost control (human review is expensive, so threshold must be precise)
User trust (false positives destroy credibility)

Current scoring factors (23 dimensions per Stage 8):

Validator pass/fail (facts, schema, grammar, tone, length, security, etc.)
Input clarity (5–10 dimensions)
Output coherence (5 dimensions)
Task complexity (model match, context relevance)
Historical accuracy (task_type-model pair performance)

Learning cycles must tune these weights:

Short cycles (6h) → reactive, catch fresh degradation
Medium cycles (12h) → strategic, smooth seasonal trends
Long cycles (24h) → historical, guide model selection

Decision

Implement three-tier confidence gating with autonomous learning cycles:

Gate thresholds (0–10 scale):
- 0–4: pending_review (queue for human approval + learning)
- 4–7: warning (deliver with confidence metadata, flag in dashboard)
- 7–10: approved (deliver directly, log for metrics)

Confidence scoring formula:

base_score = (validators_passed / total_validators) × 8 + input_clarity × 0.5 + coherence × 1.5
final_score = base_score × (1 + task_accuracy_bonus - hallucination_penalty)

Where:

task_accuracy_bonus: +0.5 if historical accuracy >90%, +0.2 if >75%
hallucination_penalty: -0.5 if task requires_fact_check and validator failed

Learning cycles:
- 6h: Adjust validator weights (which validators are most predictive?)
- 12h: Recalibrate thresholds (is 5.0 still the right boundary for review?)
- 24h: Assess model-tier assignments (should task_X move from fast → medium?)
Feedback loop:
- Human reviewers mark outputs as: approved, modified, rejected
- Learning engine correlates human decisions with confidence scores
- Thresholds drift automatically (e.g., if 80% of 5.0–5.5 are approved, raise threshold)
Cold-start handling:
- New task types → default to pending_review until 20+ human reviews collected
- New model combinations → higher threshold until historical data accumulates

Alternatives Considered

Alternative 1: Static Thresholds (no learning)

Pros: Predictable, easy to reason about
Cons: Doesn't adapt to model improvements, domain shifts, or drifting validator accuracy
Why not: Gateway exists to learn; static thresholds waste the learning engine

Alternative 2: Single-Threshold (0–5 review, 5–10 approve)

Pros: Simpler rules, fewer parameters
Cons: Loses nuance; "warning" bucket is valuable for monitoring edge cases
Why not: Warning zone catches systematic issues (e.g., "all medium-tier outputs on fact-check are slightly off")

Alternative 3: Per-Caller Thresholds (TIP uses 6.0, EO uses 4.0)

Pros: Each caller can tune tolerance
Cons: Inconsistent quality, hard to debug when results vary
Why not: Gateway quality is uniform; thresholds should reflect task complexity, not caller whim

Consequences

Positive

Automatic adaptation: Confidence thresholds self-tune to model quality over time
Learning visibility: Dashboard shows why outputs were gated (which validator failed, etc.)
Cost optimized: As models improve, fewer outputs queue for human review
Quality feedback loop: Human reviews train the confidence scorer iteratively

Negative

Delayed convergence: Takes 1-2 weeks for learning cycles to stabilize thresholds
- Mitigate: Seed with domain expert estimates; learning adjusts from there
Threshold oscillation: If human review feedback is noisy, thresholds drift erratically
- Mitigate: Smoothing filter (move thresholds by max ±0.1 per cycle)
Review queue backlog: If 30% of requests pending_review, human team is bottleneck
- Mitigate: Escalate high-confidence 4.0–5.0 to 6.0 after 1 week of queue backup

Risks

Feedback bias: If certain human reviewers are harsher/more lenient, learning is skewed
- Mitigate: Track reviewer agreement (Cohen's kappa) and weight accordingly
Confidence score gaming: If models learn to game the scoring formula, thresholds creep up
- Mitigate: Periodic validator audit; ensure validators measure actual quality
Cascading threshold drift: If thresholds move in response to poor model tier assignment, learning mixes causes
- Mitigate: Separate tier learning (12h) from threshold learning (12h) with distinct signals

Implementation Notes

Scoring implementation (existing in confidence-gate.ts):
- Track all 23 dimensions during completion
- Compute base_score from validators
- Apply bonuses/penalties based on historical task accuracy
- Return confidence + base_score + impacts (for UI debugging)

Learning cycles (learning-engine.ts):

// 6h: Reweight validators
const validator_accuracy = await queryValidatorAccuracy(taskType);
weights.grammar = validator_accuracy['grammar'].precision;

// 12h: Adjust thresholds
const human_reviews = await queryHumanReviews(taskType, hours=48);
const current_threshold = thresholds[taskType].review;
const true_positive_rate = human_reviews.filter(r => r.human_approved && r.confidence > current_threshold).length / human_reviews.length;
if (true_positive_rate > 0.9) thresholds[taskType].review += 0.1;

// 24h: Assess model assignments
const perf = await queryModelPerformance(taskType);
if (perf.fast_confidence < 4.5) changeDefaultTier(taskType, 'fast', 'medium');

Dashboard metrics:
- Confidence score distribution (histogram)
- Review queue size and age
- Validator contribution to score (Shapley values or feature importance)
- Threshold history over time (chart showing drift)
- Human reviewer agreement rate
Monitoring thresholds:
- Alert if review queue >24h backlog
- Alert if any threshold drifts >0.5 in 24h (possible feedback bias)
- Alert if validator accuracy drops >10% (model degradation signal)

ADR-0001: Multi-Agent Coworking Architecture
ADR-0002: Tier assignment strategy
ADR-0004: External provider fallback chain ordering

6.4 KiB Raw Permalink Blame History Unescape Escape