llm-gateway/AI_CONTROL_PLANE_SYSTEM_DESIGN.md

# AI Control Plane System Design

## 1. Purpose

LLM Gateway is a deterministic, observable, policy-driven routing layer for AI execution with memory and cost control.

It routes requests from clients to the right model, provider, agent, or tool based on:

- policy
- cost
- availability
- context
- memory
- trust level
- historical route success

It also provides:

- full observability through immutable receipts
- reproducible AI runs
- shared memory persistence
- route memory
- token and cost optimization

## 2. High-Level Architecture

```text
Input Layer
  clients, APIs, MCP, internal connectors
      |
      v
Control Plane
  trust routing, policy, compression, memory, provider routing
      |
      v
Execution Layer
  local models, external providers, tools, services
      |
      v
Output
  response to caller
      |
      v
Receipts + Memory Update

Side System:
  Memory Layer
    global memory, project memory, route memory, semantic cache
```

## 3. Components

### 3.1 Client Entry

Clients connect via API, MCP, OpenAI-compatible endpoints, or internal connectors.

Supported client targets:

- Codex
- Claude Code
- ChatGPT
- Cursor
- VS Code and Continue-style IDEs
- automation pipelines
- n8n
- internal services

Each request should include:

- payload: prompt, input, files, tool call, or task
- metadata: user, project, agent, task type
- optional routing hints
- optional policy hints

### 3.2 Trust Router

The Trust Router is the first decision point.

Responsibilities:

- validate client identity
- assign trust level
- classify request type
- classify data sensitivity
- apply initial routing hints
- attach enriched request context

Example classification labels:

- code
- infra
- legal
- security
- general
- document
- automation

Output:

- enriched request context
- trust score
- sensitivity label
- classification label

### 3.3 Policy Engine

The Policy Engine is the core decision system.

It evaluates:

- data sensitivity
- allowed providers
- allowed models
- allowed tools
- cost constraints
- project rules
- compliance rules
- offline/simulation/live mode

Example policies:

- never send legal data to public APIs
- prefer local models for internal code
- use external models only if confidence is below a threshold
- block requests containing secrets
- require admin override for production deployment tools

Output:

- allowed routes
- blocked routes
- required redactions
- execution constraints
- policy decision log

### 3.4 Memory Query

Memory is queried before compression and execution.

Memory sources:

- project memory
- global memory
- route memory
- semantic cache
- handoffs
- receipts
- reproducible runs

Output:

- relevant memory context
- prior decisions
- route hints
- cache candidates

### 3.5 Compression Engine

The Compression Engine optimizes request and memory context before execution.

Functions:

- token reduction
- context deduplication
- semantic summarization
- cache lookup
- prompt/context packaging
- token budget enforcement

Input:

- raw request
- policy constraints
- memory context
- target model context budget

Output:

- compressed payload
- token metrics before and after
- cache hit or miss
- compression receipt data

### 3.6 Provider Router

The Provider Router makes the final execution decision.

It selects:

- local model
- external provider
- AI client/agent
- tool execution
- fallback route

Criteria:

- policy constraints
- trust level
- cost
- latency
- availability
- model capability
- route memory
- benchmark results
- agent reputation

Output:

- selected execution target
- fallback routes
- route explanation

### 3.7 Execution Layer

The Execution Layer handles actual processing.

Execution target types:

- local models such as Ollama, LM Studio, LocalAI, llama.cpp, vLLM
- external APIs such as OpenAI, Anthropic, Mistral, Groq, OpenRouter
- AI clients such as Claude Code, Codex, Cursor, ChatGPT adapters
- tools, scripts, workflows, and internal services

Execution returns:

- raw response
- latency
- token usage
- provider metadata
- errors
- tool call results

### 3.8 Receipt Engine

The Receipt Engine creates an immutable trace for each request.

Receipts include:

- request id
- input summary or redacted input
- trust decisions
- policy decisions
- memory refs
- compression results
- selected model/provider/tool
- fallback chain
- response summary or full response depending on policy
- token usage
- cost estimate
- timestamps
- errors
- blocked routes

Receipts are immutable and stored.

### 3.9 Memory Layer

Memory is separate from execution but connected to routing and compression.

Memory types:

1. Project memory
   - task history
   - decisions
   - context
   - handoffs

2. Global memory
   - shared knowledge
   - user/team preferences
   - reusable runbooks

3. Route memory
   - routing decisions
   - success and failure patterns
   - optimization feedback

4. Semantic cache
   - previous responses
   - embedding lookup
   - prompt/result reuse

Memory is:

- append-only by default
- queryable
- versioned where possible
- used during routing and compression

### 3.10 Route Reflector Memory

Route Reflector Memory is specialized route memory inspired by BGP route reflectors.

Functions:

- learns optimal AI routes
- shares routing knowledge across clients
- improves future routing decisions
- records fallback success and failures
- contributes to Provider Router decisions

Examples:

- code debugging works best through Codex plus local validation
- private infra diagnostics should route to local models
- long-form reasoning performs better on selected external models
- JSON extraction for project X has best success on model Y

## 4. Data Flow

1. Client sends request.
2. Trust Router classifies request and assigns trust.
3. Policy Engine filters allowed routes.
4. Memory Layer is queried for context and prior route knowledge.
5. Compression Engine optimizes payload.
6. Provider Router selects execution target and fallback chain.
7. Execution Layer processes request.
8. Response is returned to client.
9. Receipt Engine generates immutable receipt.
10. Memory Layer is updated with outcome.
11. Route Reflector Memory updates routing knowledge.

## 5. Modes Of Operation

### Live Mode

- real execution
- full routing active
- receipts stored
- memory updated

### Simulation Mode

- no real execution
- shows trust decisions
- shows policy decisions
- shows selected route and fallbacks
- estimates cost and tokens
- useful for testing policies

### Offline Mode

- only local models allowed
- no external provider calls
- remote sync disabled unless explicitly allowed
- receipts marked as offline

## 6. Control Functions

The system supports:

- trace request
- replay request
- force route
- override policy as admin
- inspect receipts
- inspect memory
- simulate routing
- compare routes
- inspect provider availability
- inspect route memory

## 7. Storage

Required storage components:

- receipts database: immutable logs
- memory database: structured + vector
- policy definitions
- routing history
- route reflector memory
- semantic cache
- reproducible run artifacts

Recommended default:

- SQLite for personal mode
- Postgres plus pgvector for team/server mode
- Git/Gitea as durable memory sync and audit transport

## 8. Metrics

System tracks:

- token usage
- compression ratio
- cache hit rate
- latency per provider
- cost per request
- routing success rate
- fallback rate
- trust level distribution
- blocked route count
- policy override count
- agent reputation
- benchmark scores

## 9. Security Model

- strict policy enforcement before external calls
- data classification at entry
- local-first routing possible
- no sensitive data leaves system if blocked by policy
- no secret sync to memory
- audit trail via receipts
- consent ledger for tool, memory, and provider permissions
- safe config writer for external tool setup

## 10. Extensibility

The system supports:

- new providers
- new local models
- new tools
- new MCP resources
- new policy rules
- custom routing logic
- custom memory backends
- custom benchmarks
- custom data source connectors

## 11. Core Idea

LLM Gateway is a deterministic, observable, policy-driven routing layer for AI execution with memory and cost control.