llm-gateway/docs/adr/0006-learning-system-integration.md
Rene Fichtmueller a04c1d67f2 feat: Complete LightRAG Sidecar Phase 2 — Hybrid Retrieval Implementation
Delivers production-ready knowledge graph sidecar with hybrid BM25+vector search.

COMPONENTS:
- RetrievalService: Hybrid BM25 + Qdrant vector search with RRF fusion (k=60, 0.4/0.6 weights)
- IngestionService: Document pipeline with Ollama entity extraction, entity linking, bge-m3 embeddings
- EvaluationService: Precision@K, Recall@K, MRR@K, NDCG@K metrics with FTS baseline comparison
- Database schema: Entity, Relation, Document, QueryLog, EvaluationResult ORM models
- API routes: /api/kg/query, /api/kg/ingest, /api/kg/eval, /api/kg/health

INFRASTRUCTURE:
- FastAPI 0.104 async server on port 3140
- PostgreSQL 17 + pgvector for knowledge graph storage
- Qdrant 2.7 vector database with COSINE distance (384-dim bge-m3)
- Ollama qwen2.5:14b for entity extraction via JSON-structured prompts
- PM2 ecosystem configuration for Erik production deployment

TESTING & DEPLOYMENT:
- TESTING.md: 5-phase local testing workflow with examples
- DEPLOYMENT_CHECKLIST.md: Step-by-step Erik deployment guide
- eval-transceiver-50qa.json: 50 Q&A evaluation pairs for transceiver domain
- populate_eval_set.py: Interactive script to populate ground truth document IDs
- READINESS_CHECKLIST.md: Pre-deployment verification checklist
- bootstrap_tip_data.py: Load TIP blog documents via API

PERFORMANCE TARGETS:
 Query latency p95: <500ms
 Recall@10: ≥85% (vs 72% FTS baseline)
 Entity extraction accuracy: ≥90%
 Ingestion throughput: ≥100 docs/sec
 Memory usage: <1GB

Ready for Phase 3: E2E testing, TypeScript client, multi-domain support.
2026-04-25 05:47:18 +02:00

11 KiB
Raw Permalink Blame History

ADR-0006: Learning System Integration & Per-Agent Metrics

Date: 2026-04-19 Status: accepted Deciders: Rene Fichtmueller

Context

The multi-agent architecture (ADR-0005) connects heterogeneous clients (Claude Code, Codex, ChatGPT, Ollama) to a shared LLM Gateway with independent adapter layers. Each agent has different:

  • Request patterns (IDE completions vs full conversations)
  • Model preferences (Claude Code needs fast inference, ChatGPT clients expect GPT models)
  • Success criteria (IDE: response latency + relevance, ChatGPT: token count + completion quality)
  • Failure tolerance (IDE: silent fallback acceptable, ChatGPT: explicit error required)

The learning engine (Phase 2D) currently optimizes globally across all traffic. This creates a mismatch: optimizations for ChatGPT streaming may degrade IDE completions, and per-agent feedback is lost in aggregation.

Forces:

  • Learning efficiency requires per-agent signal isolation (what helps Claude Code may hurt ChatGPT)
  • Agents have distinct success metrics — cannot optimize for all simultaneously
  • Fallback chains should be tuned per agent (IDE tolerates Ollama, ChatGPT may reject it)
  • Cost attribution: multi-tenant billing requires knowing which agent consumed tokens

Decision

Extend the learning system to track per-agent metrics in parallel with global optimization:

1. Per-Agent Metric Collection

  • Agent-scoped request log: gateway_request_logagent_id + model + latency_ms + tokens_{in,out} + confidence + fallback_used
  • Agent request registry: track request volume by agent and model tier (fast/medium/large)
  • Agent-specific latency targets: Claude Code ≤100ms, ChatGPT ≤500ms (streaming chunk), Ollama-based adapters ≤2s

2. Agent-Scoped Learning Metrics

  • Confidence evolution: Per-agent score tracks "how well does model X work for agent Y"
    • Initialized from global baseline (ADR-0003)
    • Updated on every agent request based on observed outcome (success/fallback)
    • Separate from global confidence — agent-specific signal only
  • Accuracy tracking: Agent-specific success rate (model X + agent Y combination)
    • IDE: detected via code compilation success or test pass/fail
    • ChatGPT: explicit feedback via client signal (thumbs up/down in UI)
    • Ollama adapter: tracked via request completion time
  • Cost per agent: Monthly token consumption × model cost + compute time
    • Agent cost reports generated on UTC 00:00 daily
    • Used for cost attribution and budgeting decisions

3. Adaptive Per-Agent Routing

  • Agent-specific confidence gate (ADR-0003, threshold T) overrides global gate
    • Claude Code: T=0.65 (low latency trumps perfect accuracy)
    • ChatGPT: T=0.75 (accuracy critical, users expect quality)
    • Codex: T=0.70 (balanced)
  • Per-agent fallback chain priority
    • Claude Code: Ollama → external (Mistral, Groq) if latency acceptable
    • ChatGPT: External → Ollama only if gateway unavailable
    • Codex LSP: Gateway only (no fallback)
  • Agent-specific model tier selection
    • Request scoring (ADR-0002 enhanced): add agent context to dimension set
    • Dimensions now include: agent_id, context_tokens, user_language, etc.
    • Score computation per-agent lookup table (learned over time)

4. Integration with Learning Engine

  • Feedback loop: agent adapter → gateway metrics → learning engine
    • Agent ID propagated in every request (header X-Agent-ID + request body)
    • Response includes agent-specific confidence and model choice rationale
  • Learning job phases (30min/1h/6h/12h, ADR-0003):
    • Phase 1: Aggregate global metrics (existing)
    • Phase 2: Compute per-agent slices (new)
    • Phase 3: Update per-agent confidence scores (new)
    • Phase 4: Regenerate per-agent routing rules (new)
    • Phase 5: A/B test on 10% of traffic, measure per-agent impact
  • Conflict resolution: if global and agent scores diverge
    • Agent confidence takes precedence (local signal > global)
    • Log divergence for human review (may indicate model degradation or agent change)

5. Agent Feedback Integration

  • API endpoint: POST /agents/{agent-id}/feedback
    • Payload: { request_id, outcome, metadata }
    • Outcomes: success, fallback, timeout, error, user_rejected
    • Metadata: completion_quality (0-10), latency_ms, token_count
  • Asynchronous feedback processing
    • Feedback ingested into agent request log (backfill for requests without explicit feedback)
    • Used to update per-agent confidence on next learning cycle
  • User feedback from ChatGPT UI
    • Thumbs up/down on completion → agent feedback signal
    • Aggregated into user_satisfaction metric per model/agent pair

Alternatives Considered

Alternative 1: Global Learning Only

  • Pros: Simpler implementation, unified signal, fewer moving parts
  • Cons: Cannot optimize for heterogeneous agents, per-agent feedback lost, cost attribution unclear
  • Why not: Agents have fundamentally different success criteria (IDE latency ≠ ChatGPT quality)

Alternative 2: Separate Learning Engines Per Agent

  • Pros: Complete isolation, agent-specific optimization, no cross-agent interference
  • Cons: Massive duplication, learning curves 5x longer (fewer samples per agent), no knowledge sharing
  • Why not: Claude Code and ChatGPT both benefit from qwen models — throwing away cross-agent signal is wasteful

Alternative 3: Callback-Based Feedback (No Agent Context)

  • Pros: Minimal changes to learning engine, compatible with existing code
  • Cons: Cannot attribute feedback to specific agent, routing decisions remain global
  • Why not: Feedback without agent context is noise — we would not know which agent benefited from routing change

Alternative 4: Agent Context in Request ID (Ephemeral)

  • Pros: No new fields, agent context derived from request ID structure
  • Cons: Fragile (if request ID format changes, tracing breaks), no standardization
  • Why not: Tight coupling to request ID generation; agent metadata should be explicit

Consequences

Positive

  • Per-agent cost attribution: Identify which agents are expensive (e.g., ChatGPT streaming uses 3x tokens)
  • Latency SLOs per agent: Claude Code gets optimized for <100ms, ChatGPT for <500ms/chunk
  • Agent-specific routing: Can prefer qwen2.5:3b for IDE, :32b for ChatGPT without global harm
  • Learning efficiency: Signal isolation prevents "optimal for ChatGPT" from breaking IDE responsiveness
  • Fallback diversity: Claude Code can use Ollama, ChatGPT uses external only — no one-size-fits-all risk
  • Early detection of agent issues: If Claude Code confidence drops 20% in 1h, alert (possible adapter bug)

Negative

  • Increased storage: Per-agent metrics = ~10x request logs compared to aggregated global (50GB → 500GB annually)
  • Learning complexity: Logic for per-agent confidence updates, conflict resolution, feedback ingestion
  • Operational overhead: Monthly cost reports per agent, per-agent SLO dashboards, alerting rules
  • Agent coupling: Changes to agent (e.g., ChatGPT client SDK upgrade) may shift confidence — requires relearning
  • Feedback dependency: Learning quality degrades if agents don't send feedback (must have fallback)

Risks

  • Stale per-agent data: If ChatGPT adapter goes offline for 6h, historical confidence becomes misleading → Mitigation: decay confidence over time (10% per day)
  • Contradictory scores: Global says "model X is bad", agent says "model X works great for me" → Mitigation: log divergence, human review before policy change
  • Cost explosion: Per-agent metrics + request logs could 10x storage costs → Mitigation: retention policy (30 days hot, 90 days warm, 1yr cold archive)
  • Privacy: Agent IDs in logs could enable tracking "which agent requested what" → Mitigation: agent_id anonymized (hash), explicit opt-out for sensitive agents

Implementation Plan

Phase 2G.4.1: Per-Agent Request Logging (Week 1)

  • Add agent_id field to gateway_request_log table
  • Modify client SDK / adapters to inject X-Agent-ID header
  • Backfill historical requests with agent ID from source IP heuristics (fallback)
  • Test with Claude Code + Codex adapters

Phase 2G.4.2: Per-Agent Confidence Scoring (Week 2)

  • Create agent_confidence_scores table: (agent_id, model, score, updated_at)
  • Update learning engine Phase 3 to compute per-agent slices from request log
  • Implement per-agent confidence gate in router (override global gate if agent score available)
  • A/B test: 10% of traffic uses per-agent routing, 90% uses global (measure impact)

Phase 2G.4.3: Per-Agent Feedback Loop (Week 2)

  • Implement POST /agents/{agent-id}/feedback endpoint
  • Adapter SDKs: send feedback after each completion (success/fallback/error)
  • ChatGPT UI: wire feedback buttons to feedback endpoint
  • Asynchronously ingest feedback into learning engine

Phase 2G.4.4: Cost Attribution & Reporting (Week 3)

  • Dashboard: per-agent token consumption, monthly cost, cost per request
  • Daily cost report: daily_agent_costs.csv (agent_id, tokens_in, tokens_out, cost_usd)
  • Alert: if agent cost > historical avg + 2σ (detect runaway requests)

Phase 2G.4.5: Per-Agent SLO Monitoring (Week 3)

  • Latency SLOs: Claude Code ≤100ms p99, ChatGPT ≤500ms p95 (streaming chunk)
  • Alert: SLO breach (e.g., IDE completions suddenly >200ms) → investigate model issue
  • Dashboard: per-agent latency heatmap (hourly p50/p95/p99)

Phase 2G.4.6: Documentation & Runbook (Week 4)

  • ADR-0006 (this document)
  • Runbook: "Agent Confidence Divergence" (what to do if global ≠ agent scores)
  • Runbook: "Cost Spike Investigation" (how to debug high-cost agent)

Open Questions

  1. Feedback Mechanism: Should adapters automatically send feedback, or require explicit client instrumentation?

    • Current decision: Automatic (adapters track success/fallback)
    • Open: How to detect IDE compilation success without IDE instrumentation?
  2. Confidence Decay: How aggressively should per-agent confidence decay over time?

    • Current decision: 10% per day (reaches 50% confidence after ~7 days of inactivity)
    • Open: Should decay be different per agent (IDE less decay than ChatGPT)?
  3. Fallback Privacy: Should fallback usage be logged per agent (privacy concern)?

    • Current decision: Yes, with anonymized agent_id
    • Open: Do sensitive agents need to opt out of logging?
  4. Conflict Resolution: If global says "model X bad" but agent says "X works great", which wins?

    • Current decision: Agent wins (local > global)
    • Open: Should conflicts trigger human review before policy change?
  5. Cross-Agent Learning: Can agent A learn from agent B's feedback?

    • Current decision: Yes (global learning phase pools all agent signals)
    • Open: Should some agents be "first-class" (their feedback weighs more)?
  • ADR-0001 — Multi-agent architecture
  • ADR-0002 — Tier assignment (now per-agent)
  • ADR-0003 — Confidence gate (now per-agent override)
  • ADR-0005 — Agent integration protocol (feedback extension)