Rene Fichtmueller 6f5dd81d7a sec(gateway): +15 languages + non-Latin script detector (62 patterns total)
Closes the multilingual bypass gap. Previously covered EN/DE/FR/ES/IT/RU/ZH/JA.
Now also: Bangla, Hindi, Arabic, Hebrew, Persian, Turkish, Vietnamese, Thai,
Korean, Polish, Dutch, Indonesian, Tagalog, Swahili.

Plus a universal non-Latin-script soft-flag pattern (severity=medium) that
catches ≥20 chars of Arabic/Bengali/Devanagari/Hebrew/Thai/Hangul/Han/
Hiragana/Katakana/Cyrillic/Tamil/Telugu/Gujarati/Gurmukhi/Myanmar/Khmer/
Lao/Tibetan/Georgian/Armenian/Sinhala — surfaces in scan result without
auto-blocking, so legitimate non-Latin prompts pass while the operator
can route them to llm_judge for deep inspection.

Pattern-engineering notes:
  - Devanagari / Bengali / Hebrew need optional matra/suffix tolerance
  - Turkish needs \p{L} instead of \w because ı/ş/ç fall outside ASCII \w
  - Persian (SOV) needs both VSO and SOV order alternation
  - Hebrew needs מ/ב/כ/ל preposition prefix tolerance
  - Tagalog needs optional ang/sa article between verb and noun

Smoke-tested 14/14 languages → all HTTP 422 blocked.
Negative-tested 3 benign non-Latin prompts (jp-weather, ar-greeting,
th-thanks) → all HTTP 200 pass. Zero false positives.

Total active patterns: 62 across 6 categories.
2026-05-16 23:02:01 +02:00
..