llm-gateway/docs/adr/0006-learning-system-integration.md
Rene Fichtmueller a04c1d67f2 feat: Complete LightRAG Sidecar Phase 2 — Hybrid Retrieval Implementation
Delivers production-ready knowledge graph sidecar with hybrid BM25+vector search.

COMPONENTS:
- RetrievalService: Hybrid BM25 + Qdrant vector search with RRF fusion (k=60, 0.4/0.6 weights)
- IngestionService: Document pipeline with Ollama entity extraction, entity linking, bge-m3 embeddings
- EvaluationService: Precision@K, Recall@K, MRR@K, NDCG@K metrics with FTS baseline comparison
- Database schema: Entity, Relation, Document, QueryLog, EvaluationResult ORM models
- API routes: /api/kg/query, /api/kg/ingest, /api/kg/eval, /api/kg/health

INFRASTRUCTURE:
- FastAPI 0.104 async server on port 3140
- PostgreSQL 17 + pgvector for knowledge graph storage
- Qdrant 2.7 vector database with COSINE distance (384-dim bge-m3)
- Ollama qwen2.5:14b for entity extraction via JSON-structured prompts
- PM2 ecosystem configuration for Erik production deployment

TESTING & DEPLOYMENT:
- TESTING.md: 5-phase local testing workflow with examples
- DEPLOYMENT_CHECKLIST.md: Step-by-step Erik deployment guide
- eval-transceiver-50qa.json: 50 Q&A evaluation pairs for transceiver domain
- populate_eval_set.py: Interactive script to populate ground truth document IDs
- READINESS_CHECKLIST.md: Pre-deployment verification checklist
- bootstrap_tip_data.py: Load TIP blog documents via API

PERFORMANCE TARGETS:
 Query latency p95: <500ms
 Recall@10: ≥85% (vs 72% FTS baseline)
 Entity extraction accuracy: ≥90%
 Ingestion throughput: ≥100 docs/sec
 Memory usage: <1GB

Ready for Phase 3: E2E testing, TypeScript client, multi-domain support.
2026-04-25 05:47:18 +02:00

192 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-0006: Learning System Integration & Per-Agent Metrics
**Date**: 2026-04-19
**Status**: accepted
**Deciders**: Rene Fichtmueller
## Context
The multi-agent architecture (ADR-0005) connects heterogeneous clients (Claude Code, Codex, ChatGPT, Ollama) to a shared LLM Gateway with independent adapter layers. Each agent has different:
- Request patterns (IDE completions vs full conversations)
- Model preferences (Claude Code needs fast inference, ChatGPT clients expect GPT models)
- Success criteria (IDE: response latency + relevance, ChatGPT: token count + completion quality)
- Failure tolerance (IDE: silent fallback acceptable, ChatGPT: explicit error required)
The learning engine (Phase 2D) currently optimizes globally across all traffic. This creates a mismatch: optimizations for ChatGPT streaming may degrade IDE completions, and per-agent feedback is lost in aggregation.
**Forces:**
- Learning efficiency requires per-agent signal isolation (what helps Claude Code may hurt ChatGPT)
- Agents have distinct success metrics — cannot optimize for all simultaneously
- Fallback chains should be tuned per agent (IDE tolerates Ollama, ChatGPT may reject it)
- Cost attribution: multi-tenant billing requires knowing which agent consumed tokens
## Decision
Extend the learning system to track per-agent metrics in parallel with global optimization:
**1. Per-Agent Metric Collection**
- Agent-scoped request log: `gateway_request_log``agent_id` + `model` + `latency_ms` + `tokens_{in,out}` + `confidence` + `fallback_used`
- Agent request registry: track request volume by agent and model tier (fast/medium/large)
- Agent-specific latency targets: Claude Code ≤100ms, ChatGPT ≤500ms (streaming chunk), Ollama-based adapters ≤2s
**2. Agent-Scoped Learning Metrics**
- **Confidence evolution**: Per-agent score tracks "how well does model X work for agent Y"
- Initialized from global baseline (ADR-0003)
- Updated on every agent request based on observed outcome (success/fallback)
- Separate from global confidence — agent-specific signal only
- **Accuracy tracking**: Agent-specific success rate (model X + agent Y combination)
- IDE: detected via code compilation success or test pass/fail
- ChatGPT: explicit feedback via client signal (thumbs up/down in UI)
- Ollama adapter: tracked via request completion time
- **Cost per agent**: Monthly token consumption × model cost + compute time
- Agent cost reports generated on UTC 00:00 daily
- Used for cost attribution and budgeting decisions
**3. Adaptive Per-Agent Routing**
- Agent-specific confidence gate (ADR-0003, threshold T) overrides global gate
- Claude Code: T=0.65 (low latency trumps perfect accuracy)
- ChatGPT: T=0.75 (accuracy critical, users expect quality)
- Codex: T=0.70 (balanced)
- Per-agent fallback chain priority
- Claude Code: Ollama → external (Mistral, Groq) if latency acceptable
- ChatGPT: External → Ollama only if gateway unavailable
- Codex LSP: Gateway only (no fallback)
- Agent-specific model tier selection
- Request scoring (ADR-0002 enhanced): add agent context to dimension set
- Dimensions now include: `agent_id`, `context_tokens`, `user_language`, etc.
- Score computation per-agent lookup table (learned over time)
**4. Integration with Learning Engine**
- Feedback loop: agent adapter → gateway metrics → learning engine
- Agent ID propagated in every request (header `X-Agent-ID` + request body)
- Response includes agent-specific confidence and model choice rationale
- Learning job phases (30min/1h/6h/12h, ADR-0003):
- Phase 1: Aggregate global metrics (existing)
- Phase 2: Compute per-agent slices (new)
- Phase 3: Update per-agent confidence scores (new)
- Phase 4: Regenerate per-agent routing rules (new)
- Phase 5: A/B test on 10% of traffic, measure per-agent impact
- Conflict resolution: if global and agent scores diverge
- Agent confidence takes precedence (local signal > global)
- Log divergence for human review (may indicate model degradation or agent change)
**5. Agent Feedback Integration**
- API endpoint: `POST /agents/{agent-id}/feedback`
- Payload: `{ request_id, outcome, metadata }`
- Outcomes: `success`, `fallback`, `timeout`, `error`, `user_rejected`
- Metadata: completion_quality (0-10), latency_ms, token_count
- Asynchronous feedback processing
- Feedback ingested into agent request log (backfill for requests without explicit feedback)
- Used to update per-agent confidence on next learning cycle
- User feedback from ChatGPT UI
- Thumbs up/down on completion → agent feedback signal
- Aggregated into `user_satisfaction` metric per model/agent pair
## Alternatives Considered
### Alternative 1: Global Learning Only
- **Pros**: Simpler implementation, unified signal, fewer moving parts
- **Cons**: Cannot optimize for heterogeneous agents, per-agent feedback lost, cost attribution unclear
- **Why not**: Agents have fundamentally different success criteria (IDE latency ≠ ChatGPT quality)
### Alternative 2: Separate Learning Engines Per Agent
- **Pros**: Complete isolation, agent-specific optimization, no cross-agent interference
- **Cons**: Massive duplication, learning curves 5x longer (fewer samples per agent), no knowledge sharing
- **Why not**: Claude Code and ChatGPT both benefit from qwen models — throwing away cross-agent signal is wasteful
### Alternative 3: Callback-Based Feedback (No Agent Context)
- **Pros**: Minimal changes to learning engine, compatible with existing code
- **Cons**: Cannot attribute feedback to specific agent, routing decisions remain global
- **Why not**: Feedback without agent context is noise — we would not know which agent benefited from routing change
### Alternative 4: Agent Context in Request ID (Ephemeral)
- **Pros**: No new fields, agent context derived from request ID structure
- **Cons**: Fragile (if request ID format changes, tracing breaks), no standardization
- **Why not**: Tight coupling to request ID generation; agent metadata should be explicit
## Consequences
### Positive
- **Per-agent cost attribution**: Identify which agents are expensive (e.g., ChatGPT streaming uses 3x tokens)
- **Latency SLOs per agent**: Claude Code gets optimized for <100ms, ChatGPT for <500ms/chunk
- **Agent-specific routing**: Can prefer qwen2.5:3b for IDE, :32b for ChatGPT without global harm
- **Learning efficiency**: Signal isolation prevents "optimal for ChatGPT" from breaking IDE responsiveness
- **Fallback diversity**: Claude Code can use Ollama, ChatGPT uses external only no one-size-fits-all risk
- **Early detection of agent issues**: If Claude Code confidence drops 20% in 1h, alert (possible adapter bug)
### Negative
- **Increased storage**: Per-agent metrics = ~10x request logs compared to aggregated global (50GB 500GB annually)
- **Learning complexity**: Logic for per-agent confidence updates, conflict resolution, feedback ingestion
- **Operational overhead**: Monthly cost reports per agent, per-agent SLO dashboards, alerting rules
- **Agent coupling**: Changes to agent (e.g., ChatGPT client SDK upgrade) may shift confidence requires relearning
- **Feedback dependency**: Learning quality degrades if agents don't send feedback (must have fallback)
### Risks
- **Stale per-agent data**: If ChatGPT adapter goes offline for 6h, historical confidence becomes misleading Mitigation: decay confidence over time (10% per day)
- **Contradictory scores**: Global says "model X is bad", agent says "model X works great for me" Mitigation: log divergence, human review before policy change
- **Cost explosion**: Per-agent metrics + request logs could 10x storage costs Mitigation: retention policy (30 days hot, 90 days warm, 1yr cold archive)
- **Privacy**: Agent IDs in logs could enable tracking "which agent requested what" Mitigation: agent_id anonymized (hash), explicit opt-out for sensitive agents
## Implementation Plan
### Phase 2G.4.1: Per-Agent Request Logging (Week 1)
- Add `agent_id` field to `gateway_request_log` table
- Modify client SDK / adapters to inject `X-Agent-ID` header
- Backfill historical requests with agent ID from source IP heuristics (fallback)
- Test with Claude Code + Codex adapters
### Phase 2G.4.2: Per-Agent Confidence Scoring (Week 2)
- Create `agent_confidence_scores` table: `(agent_id, model, score, updated_at)`
- Update learning engine Phase 3 to compute per-agent slices from request log
- Implement per-agent confidence gate in router (override global gate if agent score available)
- A/B test: 10% of traffic uses per-agent routing, 90% uses global (measure impact)
### Phase 2G.4.3: Per-Agent Feedback Loop (Week 2)
- Implement `POST /agents/{agent-id}/feedback` endpoint
- Adapter SDKs: send feedback after each completion (success/fallback/error)
- ChatGPT UI: wire feedback buttons to feedback endpoint
- Asynchronously ingest feedback into learning engine
### Phase 2G.4.4: Cost Attribution & Reporting (Week 3)
- Dashboard: per-agent token consumption, monthly cost, cost per request
- Daily cost report: `daily_agent_costs.csv` (agent_id, tokens_in, tokens_out, cost_usd)
- Alert: if agent cost > historical avg + 2σ (detect runaway requests)
### Phase 2G.4.5: Per-Agent SLO Monitoring (Week 3)
- Latency SLOs: Claude Code ≤100ms p99, ChatGPT ≤500ms p95 (streaming chunk)
- Alert: SLO breach (e.g., IDE completions suddenly >200ms) → investigate model issue
- Dashboard: per-agent latency heatmap (hourly p50/p95/p99)
### Phase 2G.4.6: Documentation & Runbook (Week 4)
- ADR-0006 (this document)
- Runbook: "Agent Confidence Divergence" (what to do if global ≠ agent scores)
- Runbook: "Cost Spike Investigation" (how to debug high-cost agent)
## Open Questions
1. **Feedback Mechanism**: Should adapters automatically send feedback, or require explicit client instrumentation?
- Current decision: Automatic (adapters track success/fallback)
- Open: How to detect IDE compilation success without IDE instrumentation?
2. **Confidence Decay**: How aggressively should per-agent confidence decay over time?
- Current decision: 10% per day (reaches 50% confidence after ~7 days of inactivity)
- Open: Should decay be different per agent (IDE less decay than ChatGPT)?
3. **Fallback Privacy**: Should fallback usage be logged per agent (privacy concern)?
- Current decision: Yes, with anonymized agent_id
- Open: Do sensitive agents need to opt out of logging?
4. **Conflict Resolution**: If global says "model X bad" but agent says "X works great", which wins?
- Current decision: Agent wins (local > global)
- Open: Should conflicts trigger human review before policy change?
5. **Cross-Agent Learning**: Can agent A learn from agent B's feedback?
- Current decision: Yes (global learning phase pools all agent signals)
- Open: Should some agents be "first-class" (their feedback weighs more)?
## Related ADRs
- [ADR-0001](0001-multi-agent-coworking-architecture.md) — Multi-agent architecture
- [ADR-0002](0002-tier-assignment-strategy.md) — Tier assignment (now per-agent)
- [ADR-0003](0003-confidence-gate-thresholds.md) — Confidence gate (now per-agent override)
- [ADR-0005](0005-agent-integration-protocol.md) — Agent integration protocol (feedback extension)