8.1 KiB
AI Control Plane System Design
1. Purpose
LLM Gateway is a deterministic, observable, policy-driven routing layer for AI execution with memory and cost control.
It routes requests from clients to the right model, provider, agent, or tool based on:
- policy
- cost
- availability
- context
- memory
- trust level
- historical route success
It also provides:
- full observability through immutable receipts
- reproducible AI runs
- shared memory persistence
- route memory
- token and cost optimization
2. High-Level Architecture
Input Layer
clients, APIs, MCP, internal connectors
|
v
Control Plane
trust routing, policy, compression, memory, provider routing
|
v
Execution Layer
local models, external providers, tools, services
|
v
Output
response to caller
|
v
Receipts + Memory Update
Side System:
Memory Layer
global memory, project memory, route memory, semantic cache
3. Components
3.1 Client Entry
Clients connect via API, MCP, OpenAI-compatible endpoints, or internal connectors.
Supported client targets:
- Codex
- Claude Code
- ChatGPT
- Cursor
- VS Code and Continue-style IDEs
- automation pipelines
- n8n
- internal services
Each request should include:
- payload: prompt, input, files, tool call, or task
- metadata: user, project, agent, task type
- optional routing hints
- optional policy hints
3.2 Trust Router
The Trust Router is the first decision point.
Responsibilities:
- validate client identity
- assign trust level
- classify request type
- classify data sensitivity
- apply initial routing hints
- attach enriched request context
Example classification labels:
- code
- infra
- legal
- security
- general
- document
- automation
Output:
- enriched request context
- trust score
- sensitivity label
- classification label
3.3 Policy Engine
The Policy Engine is the core decision system.
It evaluates:
- data sensitivity
- allowed providers
- allowed models
- allowed tools
- cost constraints
- project rules
- compliance rules
- offline/simulation/live mode
Example policies:
- never send legal data to public APIs
- prefer local models for internal code
- use external models only if confidence is below a threshold
- block requests containing secrets
- require admin override for production deployment tools
Output:
- allowed routes
- blocked routes
- required redactions
- execution constraints
- policy decision log
3.4 Memory Query
Memory is queried before compression and execution.
Memory sources:
- project memory
- global memory
- route memory
- semantic cache
- handoffs
- receipts
- reproducible runs
Output:
- relevant memory context
- prior decisions
- route hints
- cache candidates
3.5 Compression Engine
The Compression Engine optimizes request and memory context before execution.
Functions:
- token reduction
- context deduplication
- semantic summarization
- cache lookup
- prompt/context packaging
- token budget enforcement
Input:
- raw request
- policy constraints
- memory context
- target model context budget
Output:
- compressed payload
- token metrics before and after
- cache hit or miss
- compression receipt data
3.6 Provider Router
The Provider Router makes the final execution decision.
It selects:
- local model
- external provider
- AI client/agent
- tool execution
- fallback route
Criteria:
- policy constraints
- trust level
- cost
- latency
- availability
- model capability
- route memory
- benchmark results
- agent reputation
Output:
- selected execution target
- fallback routes
- route explanation
3.7 Execution Layer
The Execution Layer handles actual processing.
Execution target types:
- local models such as Ollama, LM Studio, LocalAI, llama.cpp, vLLM
- external APIs such as OpenAI, Anthropic, Mistral, Groq, OpenRouter
- AI clients such as Claude Code, Codex, Cursor, ChatGPT adapters
- tools, scripts, workflows, and internal services
Execution returns:
- raw response
- latency
- token usage
- provider metadata
- errors
- tool call results
3.8 Receipt Engine
The Receipt Engine creates an immutable trace for each request.
Receipts include:
- request id
- input summary or redacted input
- trust decisions
- policy decisions
- memory refs
- compression results
- selected model/provider/tool
- fallback chain
- response summary or full response depending on policy
- token usage
- cost estimate
- timestamps
- errors
- blocked routes
Receipts are immutable and stored.
3.9 Memory Layer
Memory is separate from execution but connected to routing and compression.
Memory types:
-
Project memory
- task history
- decisions
- context
- handoffs
-
Global memory
- shared knowledge
- user/team preferences
- reusable runbooks
-
Route memory
- routing decisions
- success and failure patterns
- optimization feedback
-
Semantic cache
- previous responses
- embedding lookup
- prompt/result reuse
Memory is:
- append-only by default
- queryable
- versioned where possible
- used during routing and compression
3.10 Route Reflector Memory
Route Reflector Memory is specialized route memory inspired by BGP route reflectors.
Functions:
- learns optimal AI routes
- shares routing knowledge across clients
- improves future routing decisions
- records fallback success and failures
- contributes to Provider Router decisions
Examples:
- code debugging works best through Codex plus local validation
- private infra diagnostics should route to local models
- long-form reasoning performs better on selected external models
- JSON extraction for project X has best success on model Y
4. Data Flow
- Client sends request.
- Trust Router classifies request and assigns trust.
- Policy Engine filters allowed routes.
- Memory Layer is queried for context and prior route knowledge.
- Compression Engine optimizes payload.
- Provider Router selects execution target and fallback chain.
- Execution Layer processes request.
- Response is returned to client.
- Receipt Engine generates immutable receipt.
- Memory Layer is updated with outcome.
- Route Reflector Memory updates routing knowledge.
5. Modes Of Operation
Live Mode
- real execution
- full routing active
- receipts stored
- memory updated
Simulation Mode
- no real execution
- shows trust decisions
- shows policy decisions
- shows selected route and fallbacks
- estimates cost and tokens
- useful for testing policies
Offline Mode
- only local models allowed
- no external provider calls
- remote sync disabled unless explicitly allowed
- receipts marked as offline
6. Control Functions
The system supports:
- trace request
- replay request
- force route
- override policy as admin
- inspect receipts
- inspect memory
- simulate routing
- compare routes
- inspect provider availability
- inspect route memory
7. Storage
Required storage components:
- receipts database: immutable logs
- memory database: structured + vector
- policy definitions
- routing history
- route reflector memory
- semantic cache
- reproducible run artifacts
Recommended default:
- SQLite for personal mode
- Postgres plus pgvector for team/server mode
- Git/Gitea as durable memory sync and audit transport
8. Metrics
System tracks:
- token usage
- compression ratio
- cache hit rate
- latency per provider
- cost per request
- routing success rate
- fallback rate
- trust level distribution
- blocked route count
- policy override count
- agent reputation
- benchmark scores
9. Security Model
- strict policy enforcement before external calls
- data classification at entry
- local-first routing possible
- no sensitive data leaves system if blocked by policy
- no secret sync to memory
- audit trail via receipts
- consent ledger for tool, memory, and provider permissions
- safe config writer for external tool setup
10. Extensibility
The system supports:
- new providers
- new local models
- new tools
- new MCP resources
- new policy rules
- custom routing logic
- custom memory backends
- custom benchmarks
- custom data source connectors
11. Core Idea
LLM Gateway is a deterministic, observable, policy-driven routing layer for AI execution with memory and cost control.