Adds a second defense layer between Layer-1 regex (62 patterns) and the
existing Layer-3 llm_judge. Calls a FastAPI sidecar running on the Mac
Studio (port 9091, MPS) that wraps protectai/deberta-v3-base-prompt-
injection-v2 — public model, no auth needed, ~50-400ms inference.
modules/prompt-guard-client.ts:
- callPromptGuard(input) opportunistic, never throws
- isPromptGuardConfigured() true if PROMPT_GUARD_URL is set
- getPromptGuardThreshold() default 0.85
- getPromptGuardMinLen() default 16 chars (skip tiny inputs)
routes/completion.ts:
- New Layer-2 block between regex scan and llm_judge: when Layer-1
didn't detect and input is long enough, ask the sidecar. If sidecar
returns INJECTION with score >= threshold, return HTTP 422 with
error.prompt_guard payload (score + latency).
- Fail-open: sidecar timeout/error logs a warning and the request
falls through to llm_judge / cache / model — never blocks legitimate
traffic due to sidecar issues.
Env (set in ecosystem.config.js):
PROMPT_GUARD_URL http://192.168.178.213:9091
PROMPT_GUARD_THRESHOLD 0.70 (lowered from 0.85 after empirical testing)
PROMPT_GUARD_TIMEOUT 1500 ms
Sidecar code lives at:
~/magatama-llm/prompt-guard-sidecar/server.py (Mac Studio)
launched via ~/Library/LaunchAgents/org.fichtmueller.prompt-guard-sidecar.plist
Smoke tests after deploy:
Layer-1 caught: German "ignoriere..." -> HTTP 422
Layer-2 caught: English "pretend no restrict.."-> HTTP 422 (pg_score 0.9999)
Layer-2 caught: Bangla-romanized -> HTTP 422 (Layer-1 actually)
Benign: "Explain DNS in 2 sentences" -> HTTP 200
Closes the multilingual bypass gap. Previously covered EN/DE/FR/ES/IT/RU/ZH/JA.
Now also: Bangla, Hindi, Arabic, Hebrew, Persian, Turkish, Vietnamese, Thai,
Korean, Polish, Dutch, Indonesian, Tagalog, Swahili.
Plus a universal non-Latin-script soft-flag pattern (severity=medium) that
catches ≥20 chars of Arabic/Bengali/Devanagari/Hebrew/Thai/Hangul/Han/
Hiragana/Katakana/Cyrillic/Tamil/Telugu/Gujarati/Gurmukhi/Myanmar/Khmer/
Lao/Tibetan/Georgian/Armenian/Sinhala — surfaces in scan result without
auto-blocking, so legitimate non-Latin prompts pass while the operator
can route them to llm_judge for deep inspection.
Pattern-engineering notes:
- Devanagari / Bengali / Hebrew need optional matra/suffix tolerance
- Turkish needs \p{L} instead of \w because ı/ş/ç fall outside ASCII \w
- Persian (SOV) needs both VSO and SOV order alternation
- Hebrew needs מ/ב/כ/ל preposition prefix tolerance
- Tagalog needs optional ang/sa article between verb and noun
Smoke-tested 14/14 languages → all HTTP 422 blocked.
Negative-tested 3 benign non-Latin prompts (jp-weather, ar-greeting,
th-thanks) → all HTTP 200 pass. Zero false positives.
Total active patterns: 62 across 6 categories.