rene/llm-gateway

Fork 0

Rene Fichtmueller 060b846d9b feat: publish llm gateway v2 dashboard alongside restored workbench

2026-05-01 17:43:32 +02:00

8.1 KiB

Raw Blame History

AI Control Plane System Design

1. Purpose

LLM Gateway is a deterministic, observable, policy-driven routing layer for AI execution with memory and cost control.

It routes requests from clients to the right model, provider, agent, or tool based on:

policy
cost
availability
context
memory
trust level
historical route success

It also provides:

full observability through immutable receipts
reproducible AI runs
shared memory persistence
route memory
token and cost optimization

2. High-Level Architecture

Input Layer
  clients, APIs, MCP, internal connectors
      |
      v
Control Plane
  trust routing, policy, compression, memory, provider routing
      |
      v
Execution Layer
  local models, external providers, tools, services
      |
      v
Output
  response to caller
      |
      v
Receipts + Memory Update

Side System:
  Memory Layer
    global memory, project memory, route memory, semantic cache

3. Components

3.1 Client Entry

Clients connect via API, MCP, OpenAI-compatible endpoints, or internal connectors.

Supported client targets:

Codex
Claude Code
ChatGPT
Cursor
VS Code and Continue-style IDEs
automation pipelines
n8n
internal services

Each request should include:

payload: prompt, input, files, tool call, or task
metadata: user, project, agent, task type
optional routing hints
optional policy hints

3.2 Trust Router

The Trust Router is the first decision point.

Responsibilities:

validate client identity
assign trust level
classify request type
classify data sensitivity
apply initial routing hints
attach enriched request context

Example classification labels:

code
infra
legal
security
general
document
automation

Output:

enriched request context
trust score
sensitivity label
classification label

3.3 Policy Engine

The Policy Engine is the core decision system.

It evaluates:

data sensitivity
allowed providers
allowed models
allowed tools
cost constraints
project rules
compliance rules
offline/simulation/live mode

Example policies:

never send legal data to public APIs
prefer local models for internal code
use external models only if confidence is below a threshold
block requests containing secrets
require admin override for production deployment tools

Output:

allowed routes
blocked routes
required redactions
execution constraints
policy decision log

3.4 Memory Query

Memory is queried before compression and execution.

Memory sources:

project memory
global memory
route memory
semantic cache
handoffs
receipts
reproducible runs

Output:

relevant memory context
prior decisions
route hints
cache candidates

3.5 Compression Engine

The Compression Engine optimizes request and memory context before execution.

Functions:

token reduction
context deduplication
semantic summarization
cache lookup
prompt/context packaging
token budget enforcement

Input:

raw request
policy constraints
memory context
target model context budget

Output:

compressed payload
token metrics before and after
cache hit or miss
compression receipt data

3.6 Provider Router

The Provider Router makes the final execution decision.

It selects:

local model
external provider
AI client/agent
tool execution
fallback route

Criteria:

policy constraints
trust level
cost
latency
availability
model capability
route memory
benchmark results
agent reputation

Output:

selected execution target
fallback routes
route explanation

3.7 Execution Layer

The Execution Layer handles actual processing.

Execution target types:

local models such as Ollama, LM Studio, LocalAI, llama.cpp, vLLM
external APIs such as OpenAI, Anthropic, Mistral, Groq, OpenRouter
AI clients such as Claude Code, Codex, Cursor, ChatGPT adapters
tools, scripts, workflows, and internal services

Execution returns:

raw response
latency
token usage
provider metadata
errors
tool call results

3.8 Receipt Engine

The Receipt Engine creates an immutable trace for each request.

Receipts include:

request id
input summary or redacted input
trust decisions
policy decisions
memory refs
compression results
selected model/provider/tool
fallback chain
response summary or full response depending on policy
token usage
cost estimate
timestamps
errors
blocked routes

Receipts are immutable and stored.

3.9 Memory Layer

Memory is separate from execution but connected to routing and compression.

Memory types:

Project memory
- task history
- decisions
- context
- handoffs
Global memory
- shared knowledge
- user/team preferences
- reusable runbooks
Route memory
- routing decisions
- success and failure patterns
- optimization feedback
Semantic cache
- previous responses
- embedding lookup
- prompt/result reuse

Memory is:

append-only by default
queryable
versioned where possible
used during routing and compression

3.10 Route Reflector Memory

Route Reflector Memory is specialized route memory inspired by BGP route reflectors.

Functions:

learns optimal AI routes
shares routing knowledge across clients
improves future routing decisions
records fallback success and failures
contributes to Provider Router decisions

Examples:

code debugging works best through Codex plus local validation
private infra diagnostics should route to local models
long-form reasoning performs better on selected external models
JSON extraction for project X has best success on model Y

4. Data Flow

Client sends request.
Trust Router classifies request and assigns trust.
Policy Engine filters allowed routes.
Memory Layer is queried for context and prior route knowledge.
Compression Engine optimizes payload.
Provider Router selects execution target and fallback chain.
Execution Layer processes request.
Response is returned to client.
Receipt Engine generates immutable receipt.
Memory Layer is updated with outcome.
Route Reflector Memory updates routing knowledge.

5. Modes Of Operation

Live Mode

real execution
full routing active
receipts stored
memory updated

Simulation Mode

no real execution
shows trust decisions
shows policy decisions
shows selected route and fallbacks
estimates cost and tokens
useful for testing policies

Offline Mode

only local models allowed
no external provider calls
remote sync disabled unless explicitly allowed
receipts marked as offline

6. Control Functions

The system supports:

trace request
replay request
force route
override policy as admin
inspect receipts
inspect memory
simulate routing
compare routes
inspect provider availability
inspect route memory

7. Storage

Required storage components:

receipts database: immutable logs
memory database: structured + vector
policy definitions
routing history
route reflector memory
semantic cache
reproducible run artifacts

Recommended default:

SQLite for personal mode
Postgres plus pgvector for team/server mode
Git/Gitea as durable memory sync and audit transport

8. Metrics

System tracks:

token usage
compression ratio
cache hit rate
latency per provider
cost per request
routing success rate
fallback rate
trust level distribution
blocked route count
policy override count
agent reputation
benchmark scores

9. Security Model

strict policy enforcement before external calls
data classification at entry
local-first routing possible
no sensitive data leaves system if blocked by policy
no secret sync to memory
audit trail via receipts
consent ledger for tool, memory, and provider permissions
safe config writer for external tool setup

10. Extensibility

The system supports:

new providers
new local models
new tools
new MCP resources
new policy rules
custom routing logic
custom memory backends
custom benchmarks
custom data source connectors

11. Core Idea

LLM Gateway is a deterministic, observable, policy-driven routing layer for AI execution with memory and cost control.

8.1 KiB Raw Blame History

AI Control Plane System Design

1. Purpose

2. High-Level Architecture

3. Components

3.1 Client Entry

3.2 Trust Router

3.3 Policy Engine

3.4 Memory Query

3.5 Compression Engine

3.6 Provider Router

3.7 Execution Layer

3.8 Receipt Engine

3.9 Memory Layer

3.10 Route Reflector Memory

4. Data Flow

5. Modes Of Operation

Live Mode

Simulation Mode

Offline Mode

6. Control Functions

7. Storage

8. Metrics

9. Security Model

10. Extensibility

11. Core Idea

8.1 KiB

Raw Blame History