llm-gateway/AI_CONTROL_PLANE_SYSTEM_DESIGN.md

427 lines
8.1 KiB
Markdown

# AI Control Plane System Design
## 1. Purpose
LLM Gateway is a deterministic, observable, policy-driven routing layer for AI execution with memory and cost control.
It routes requests from clients to the right model, provider, agent, or tool based on:
- policy
- cost
- availability
- context
- memory
- trust level
- historical route success
It also provides:
- full observability through immutable receipts
- reproducible AI runs
- shared memory persistence
- route memory
- token and cost optimization
## 2. High-Level Architecture
```text
Input Layer
clients, APIs, MCP, internal connectors
|
v
Control Plane
trust routing, policy, compression, memory, provider routing
|
v
Execution Layer
local models, external providers, tools, services
|
v
Output
response to caller
|
v
Receipts + Memory Update
Side System:
Memory Layer
global memory, project memory, route memory, semantic cache
```
## 3. Components
### 3.1 Client Entry
Clients connect via API, MCP, OpenAI-compatible endpoints, or internal connectors.
Supported client targets:
- Codex
- Claude Code
- ChatGPT
- Cursor
- VS Code and Continue-style IDEs
- automation pipelines
- n8n
- internal services
Each request should include:
- payload: prompt, input, files, tool call, or task
- metadata: user, project, agent, task type
- optional routing hints
- optional policy hints
### 3.2 Trust Router
The Trust Router is the first decision point.
Responsibilities:
- validate client identity
- assign trust level
- classify request type
- classify data sensitivity
- apply initial routing hints
- attach enriched request context
Example classification labels:
- code
- infra
- legal
- security
- general
- document
- automation
Output:
- enriched request context
- trust score
- sensitivity label
- classification label
### 3.3 Policy Engine
The Policy Engine is the core decision system.
It evaluates:
- data sensitivity
- allowed providers
- allowed models
- allowed tools
- cost constraints
- project rules
- compliance rules
- offline/simulation/live mode
Example policies:
- never send legal data to public APIs
- prefer local models for internal code
- use external models only if confidence is below a threshold
- block requests containing secrets
- require admin override for production deployment tools
Output:
- allowed routes
- blocked routes
- required redactions
- execution constraints
- policy decision log
### 3.4 Memory Query
Memory is queried before compression and execution.
Memory sources:
- project memory
- global memory
- route memory
- semantic cache
- handoffs
- receipts
- reproducible runs
Output:
- relevant memory context
- prior decisions
- route hints
- cache candidates
### 3.5 Compression Engine
The Compression Engine optimizes request and memory context before execution.
Functions:
- token reduction
- context deduplication
- semantic summarization
- cache lookup
- prompt/context packaging
- token budget enforcement
Input:
- raw request
- policy constraints
- memory context
- target model context budget
Output:
- compressed payload
- token metrics before and after
- cache hit or miss
- compression receipt data
### 3.6 Provider Router
The Provider Router makes the final execution decision.
It selects:
- local model
- external provider
- AI client/agent
- tool execution
- fallback route
Criteria:
- policy constraints
- trust level
- cost
- latency
- availability
- model capability
- route memory
- benchmark results
- agent reputation
Output:
- selected execution target
- fallback routes
- route explanation
### 3.7 Execution Layer
The Execution Layer handles actual processing.
Execution target types:
- local models such as Ollama, LM Studio, LocalAI, llama.cpp, vLLM
- external APIs such as OpenAI, Anthropic, Mistral, Groq, OpenRouter
- AI clients such as Claude Code, Codex, Cursor, ChatGPT adapters
- tools, scripts, workflows, and internal services
Execution returns:
- raw response
- latency
- token usage
- provider metadata
- errors
- tool call results
### 3.8 Receipt Engine
The Receipt Engine creates an immutable trace for each request.
Receipts include:
- request id
- input summary or redacted input
- trust decisions
- policy decisions
- memory refs
- compression results
- selected model/provider/tool
- fallback chain
- response summary or full response depending on policy
- token usage
- cost estimate
- timestamps
- errors
- blocked routes
Receipts are immutable and stored.
### 3.9 Memory Layer
Memory is separate from execution but connected to routing and compression.
Memory types:
1. Project memory
- task history
- decisions
- context
- handoffs
2. Global memory
- shared knowledge
- user/team preferences
- reusable runbooks
3. Route memory
- routing decisions
- success and failure patterns
- optimization feedback
4. Semantic cache
- previous responses
- embedding lookup
- prompt/result reuse
Memory is:
- append-only by default
- queryable
- versioned where possible
- used during routing and compression
### 3.10 Route Reflector Memory
Route Reflector Memory is specialized route memory inspired by BGP route reflectors.
Functions:
- learns optimal AI routes
- shares routing knowledge across clients
- improves future routing decisions
- records fallback success and failures
- contributes to Provider Router decisions
Examples:
- code debugging works best through Codex plus local validation
- private infra diagnostics should route to local models
- long-form reasoning performs better on selected external models
- JSON extraction for project X has best success on model Y
## 4. Data Flow
1. Client sends request.
2. Trust Router classifies request and assigns trust.
3. Policy Engine filters allowed routes.
4. Memory Layer is queried for context and prior route knowledge.
5. Compression Engine optimizes payload.
6. Provider Router selects execution target and fallback chain.
7. Execution Layer processes request.
8. Response is returned to client.
9. Receipt Engine generates immutable receipt.
10. Memory Layer is updated with outcome.
11. Route Reflector Memory updates routing knowledge.
## 5. Modes Of Operation
### Live Mode
- real execution
- full routing active
- receipts stored
- memory updated
### Simulation Mode
- no real execution
- shows trust decisions
- shows policy decisions
- shows selected route and fallbacks
- estimates cost and tokens
- useful for testing policies
### Offline Mode
- only local models allowed
- no external provider calls
- remote sync disabled unless explicitly allowed
- receipts marked as offline
## 6. Control Functions
The system supports:
- trace request
- replay request
- force route
- override policy as admin
- inspect receipts
- inspect memory
- simulate routing
- compare routes
- inspect provider availability
- inspect route memory
## 7. Storage
Required storage components:
- receipts database: immutable logs
- memory database: structured + vector
- policy definitions
- routing history
- route reflector memory
- semantic cache
- reproducible run artifacts
Recommended default:
- SQLite for personal mode
- Postgres plus pgvector for team/server mode
- Git/Gitea as durable memory sync and audit transport
## 8. Metrics
System tracks:
- token usage
- compression ratio
- cache hit rate
- latency per provider
- cost per request
- routing success rate
- fallback rate
- trust level distribution
- blocked route count
- policy override count
- agent reputation
- benchmark scores
## 9. Security Model
- strict policy enforcement before external calls
- data classification at entry
- local-first routing possible
- no sensitive data leaves system if blocked by policy
- no secret sync to memory
- audit trail via receipts
- consent ledger for tool, memory, and provider permissions
- safe config writer for external tool setup
## 10. Extensibility
The system supports:
- new providers
- new local models
- new tools
- new MCP resources
- new policy rules
- custom routing logic
- custom memory backends
- custom benchmarks
- custom data source connectors
## 11. Core Idea
LLM Gateway is a deterministic, observable, policy-driven routing layer for AI execution with memory and cost control.