427 lines
8.1 KiB
Markdown
427 lines
8.1 KiB
Markdown
# AI Control Plane System Design
|
|
|
|
## 1. Purpose
|
|
|
|
LLM Gateway is a deterministic, observable, policy-driven routing layer for AI execution with memory and cost control.
|
|
|
|
It routes requests from clients to the right model, provider, agent, or tool based on:
|
|
|
|
- policy
|
|
- cost
|
|
- availability
|
|
- context
|
|
- memory
|
|
- trust level
|
|
- historical route success
|
|
|
|
It also provides:
|
|
|
|
- full observability through immutable receipts
|
|
- reproducible AI runs
|
|
- shared memory persistence
|
|
- route memory
|
|
- token and cost optimization
|
|
|
|
## 2. High-Level Architecture
|
|
|
|
```text
|
|
Input Layer
|
|
clients, APIs, MCP, internal connectors
|
|
|
|
|
v
|
|
Control Plane
|
|
trust routing, policy, compression, memory, provider routing
|
|
|
|
|
v
|
|
Execution Layer
|
|
local models, external providers, tools, services
|
|
|
|
|
v
|
|
Output
|
|
response to caller
|
|
|
|
|
v
|
|
Receipts + Memory Update
|
|
|
|
Side System:
|
|
Memory Layer
|
|
global memory, project memory, route memory, semantic cache
|
|
```
|
|
|
|
## 3. Components
|
|
|
|
### 3.1 Client Entry
|
|
|
|
Clients connect via API, MCP, OpenAI-compatible endpoints, or internal connectors.
|
|
|
|
Supported client targets:
|
|
|
|
- Codex
|
|
- Claude Code
|
|
- ChatGPT
|
|
- Cursor
|
|
- VS Code and Continue-style IDEs
|
|
- automation pipelines
|
|
- n8n
|
|
- internal services
|
|
|
|
Each request should include:
|
|
|
|
- payload: prompt, input, files, tool call, or task
|
|
- metadata: user, project, agent, task type
|
|
- optional routing hints
|
|
- optional policy hints
|
|
|
|
### 3.2 Trust Router
|
|
|
|
The Trust Router is the first decision point.
|
|
|
|
Responsibilities:
|
|
|
|
- validate client identity
|
|
- assign trust level
|
|
- classify request type
|
|
- classify data sensitivity
|
|
- apply initial routing hints
|
|
- attach enriched request context
|
|
|
|
Example classification labels:
|
|
|
|
- code
|
|
- infra
|
|
- legal
|
|
- security
|
|
- general
|
|
- document
|
|
- automation
|
|
|
|
Output:
|
|
|
|
- enriched request context
|
|
- trust score
|
|
- sensitivity label
|
|
- classification label
|
|
|
|
### 3.3 Policy Engine
|
|
|
|
The Policy Engine is the core decision system.
|
|
|
|
It evaluates:
|
|
|
|
- data sensitivity
|
|
- allowed providers
|
|
- allowed models
|
|
- allowed tools
|
|
- cost constraints
|
|
- project rules
|
|
- compliance rules
|
|
- offline/simulation/live mode
|
|
|
|
Example policies:
|
|
|
|
- never send legal data to public APIs
|
|
- prefer local models for internal code
|
|
- use external models only if confidence is below a threshold
|
|
- block requests containing secrets
|
|
- require admin override for production deployment tools
|
|
|
|
Output:
|
|
|
|
- allowed routes
|
|
- blocked routes
|
|
- required redactions
|
|
- execution constraints
|
|
- policy decision log
|
|
|
|
### 3.4 Memory Query
|
|
|
|
Memory is queried before compression and execution.
|
|
|
|
Memory sources:
|
|
|
|
- project memory
|
|
- global memory
|
|
- route memory
|
|
- semantic cache
|
|
- handoffs
|
|
- receipts
|
|
- reproducible runs
|
|
|
|
Output:
|
|
|
|
- relevant memory context
|
|
- prior decisions
|
|
- route hints
|
|
- cache candidates
|
|
|
|
### 3.5 Compression Engine
|
|
|
|
The Compression Engine optimizes request and memory context before execution.
|
|
|
|
Functions:
|
|
|
|
- token reduction
|
|
- context deduplication
|
|
- semantic summarization
|
|
- cache lookup
|
|
- prompt/context packaging
|
|
- token budget enforcement
|
|
|
|
Input:
|
|
|
|
- raw request
|
|
- policy constraints
|
|
- memory context
|
|
- target model context budget
|
|
|
|
Output:
|
|
|
|
- compressed payload
|
|
- token metrics before and after
|
|
- cache hit or miss
|
|
- compression receipt data
|
|
|
|
### 3.6 Provider Router
|
|
|
|
The Provider Router makes the final execution decision.
|
|
|
|
It selects:
|
|
|
|
- local model
|
|
- external provider
|
|
- AI client/agent
|
|
- tool execution
|
|
- fallback route
|
|
|
|
Criteria:
|
|
|
|
- policy constraints
|
|
- trust level
|
|
- cost
|
|
- latency
|
|
- availability
|
|
- model capability
|
|
- route memory
|
|
- benchmark results
|
|
- agent reputation
|
|
|
|
Output:
|
|
|
|
- selected execution target
|
|
- fallback routes
|
|
- route explanation
|
|
|
|
### 3.7 Execution Layer
|
|
|
|
The Execution Layer handles actual processing.
|
|
|
|
Execution target types:
|
|
|
|
- local models such as Ollama, LM Studio, LocalAI, llama.cpp, vLLM
|
|
- external APIs such as OpenAI, Anthropic, Mistral, Groq, OpenRouter
|
|
- AI clients such as Claude Code, Codex, Cursor, ChatGPT adapters
|
|
- tools, scripts, workflows, and internal services
|
|
|
|
Execution returns:
|
|
|
|
- raw response
|
|
- latency
|
|
- token usage
|
|
- provider metadata
|
|
- errors
|
|
- tool call results
|
|
|
|
### 3.8 Receipt Engine
|
|
|
|
The Receipt Engine creates an immutable trace for each request.
|
|
|
|
Receipts include:
|
|
|
|
- request id
|
|
- input summary or redacted input
|
|
- trust decisions
|
|
- policy decisions
|
|
- memory refs
|
|
- compression results
|
|
- selected model/provider/tool
|
|
- fallback chain
|
|
- response summary or full response depending on policy
|
|
- token usage
|
|
- cost estimate
|
|
- timestamps
|
|
- errors
|
|
- blocked routes
|
|
|
|
Receipts are immutable and stored.
|
|
|
|
### 3.9 Memory Layer
|
|
|
|
Memory is separate from execution but connected to routing and compression.
|
|
|
|
Memory types:
|
|
|
|
1. Project memory
|
|
- task history
|
|
- decisions
|
|
- context
|
|
- handoffs
|
|
|
|
2. Global memory
|
|
- shared knowledge
|
|
- user/team preferences
|
|
- reusable runbooks
|
|
|
|
3. Route memory
|
|
- routing decisions
|
|
- success and failure patterns
|
|
- optimization feedback
|
|
|
|
4. Semantic cache
|
|
- previous responses
|
|
- embedding lookup
|
|
- prompt/result reuse
|
|
|
|
Memory is:
|
|
|
|
- append-only by default
|
|
- queryable
|
|
- versioned where possible
|
|
- used during routing and compression
|
|
|
|
### 3.10 Route Reflector Memory
|
|
|
|
Route Reflector Memory is specialized route memory inspired by BGP route reflectors.
|
|
|
|
Functions:
|
|
|
|
- learns optimal AI routes
|
|
- shares routing knowledge across clients
|
|
- improves future routing decisions
|
|
- records fallback success and failures
|
|
- contributes to Provider Router decisions
|
|
|
|
Examples:
|
|
|
|
- code debugging works best through Codex plus local validation
|
|
- private infra diagnostics should route to local models
|
|
- long-form reasoning performs better on selected external models
|
|
- JSON extraction for project X has best success on model Y
|
|
|
|
## 4. Data Flow
|
|
|
|
1. Client sends request.
|
|
2. Trust Router classifies request and assigns trust.
|
|
3. Policy Engine filters allowed routes.
|
|
4. Memory Layer is queried for context and prior route knowledge.
|
|
5. Compression Engine optimizes payload.
|
|
6. Provider Router selects execution target and fallback chain.
|
|
7. Execution Layer processes request.
|
|
8. Response is returned to client.
|
|
9. Receipt Engine generates immutable receipt.
|
|
10. Memory Layer is updated with outcome.
|
|
11. Route Reflector Memory updates routing knowledge.
|
|
|
|
## 5. Modes Of Operation
|
|
|
|
### Live Mode
|
|
|
|
- real execution
|
|
- full routing active
|
|
- receipts stored
|
|
- memory updated
|
|
|
|
### Simulation Mode
|
|
|
|
- no real execution
|
|
- shows trust decisions
|
|
- shows policy decisions
|
|
- shows selected route and fallbacks
|
|
- estimates cost and tokens
|
|
- useful for testing policies
|
|
|
|
### Offline Mode
|
|
|
|
- only local models allowed
|
|
- no external provider calls
|
|
- remote sync disabled unless explicitly allowed
|
|
- receipts marked as offline
|
|
|
|
## 6. Control Functions
|
|
|
|
The system supports:
|
|
|
|
- trace request
|
|
- replay request
|
|
- force route
|
|
- override policy as admin
|
|
- inspect receipts
|
|
- inspect memory
|
|
- simulate routing
|
|
- compare routes
|
|
- inspect provider availability
|
|
- inspect route memory
|
|
|
|
## 7. Storage
|
|
|
|
Required storage components:
|
|
|
|
- receipts database: immutable logs
|
|
- memory database: structured + vector
|
|
- policy definitions
|
|
- routing history
|
|
- route reflector memory
|
|
- semantic cache
|
|
- reproducible run artifacts
|
|
|
|
Recommended default:
|
|
|
|
- SQLite for personal mode
|
|
- Postgres plus pgvector for team/server mode
|
|
- Git/Gitea as durable memory sync and audit transport
|
|
|
|
## 8. Metrics
|
|
|
|
System tracks:
|
|
|
|
- token usage
|
|
- compression ratio
|
|
- cache hit rate
|
|
- latency per provider
|
|
- cost per request
|
|
- routing success rate
|
|
- fallback rate
|
|
- trust level distribution
|
|
- blocked route count
|
|
- policy override count
|
|
- agent reputation
|
|
- benchmark scores
|
|
|
|
## 9. Security Model
|
|
|
|
- strict policy enforcement before external calls
|
|
- data classification at entry
|
|
- local-first routing possible
|
|
- no sensitive data leaves system if blocked by policy
|
|
- no secret sync to memory
|
|
- audit trail via receipts
|
|
- consent ledger for tool, memory, and provider permissions
|
|
- safe config writer for external tool setup
|
|
|
|
## 10. Extensibility
|
|
|
|
The system supports:
|
|
|
|
- new providers
|
|
- new local models
|
|
- new tools
|
|
- new MCP resources
|
|
- new policy rules
|
|
- custom routing logic
|
|
- custom memory backends
|
|
- custom benchmarks
|
|
- custom data source connectors
|
|
|
|
## 11. Core Idea
|
|
|
|
LLM Gateway is a deterministic, observable, policy-driven routing layer for AI execution with memory and cost control.
|