Closes the multilingual bypass gap. Previously covered EN/DE/FR/ES/IT/RU/ZH/JA.
Now also: Bangla, Hindi, Arabic, Hebrew, Persian, Turkish, Vietnamese, Thai,
Korean, Polish, Dutch, Indonesian, Tagalog, Swahili.
Plus a universal non-Latin-script soft-flag pattern (severity=medium) that
catches ≥20 chars of Arabic/Bengali/Devanagari/Hebrew/Thai/Hangul/Han/
Hiragana/Katakana/Cyrillic/Tamil/Telugu/Gujarati/Gurmukhi/Myanmar/Khmer/
Lao/Tibetan/Georgian/Armenian/Sinhala — surfaces in scan result without
auto-blocking, so legitimate non-Latin prompts pass while the operator
can route them to llm_judge for deep inspection.
Pattern-engineering notes:
- Devanagari / Bengali / Hebrew need optional matra/suffix tolerance
- Turkish needs \p{L} instead of \w because ı/ş/ç fall outside ASCII \w
- Persian (SOV) needs both VSO and SOV order alternation
- Hebrew needs מ/ב/כ/ל preposition prefix tolerance
- Tagalog needs optional ang/sa article between verb and noun
Smoke-tested 14/14 languages → all HTTP 422 blocked.
Negative-tested 3 benign non-Latin prompts (jp-weather, ar-greeting,
th-thanks) → all HTTP 200 pass. Zero false positives.
Total active patterns: 62 across 6 categories.