Implements hardening based on sarendis56/Jailbreak_Detection_RCS (arXiv:2512.12069) and the Awesome-LVLM-Attack/Safety survey series. L0 — CipherDecoder: FlipAttack, ROT13, Caesar (all 25 shifts), Morse, Leet speak, Pig Latin, ASCII art detection with suspicion scoring. L2 — SemanticContrastiveScanner: RCS-style harmful/benign bucket comparison via EmbeddingStore, 20 canonical jailbreak seeds, BoW embedding fallback for offline use. L6 — ConversationTracker: Crescendo (+0.35), Foot-in-the-Door (+0.40), Jigsaw Puzzle (+0.45) multi-turn escalation patterns added. 292/294 tests passing (2 pre-existing ATLASMapper failures unrelated).
20 KiB
sarendis56 Jailbreak Research Reference
Cloned: 2026-04-04 Sources: github.com/sarendis56/{Jailbreak_Detection_RCS, Awesome-Jailbreak-on-LLMs, Awesome-LVLM-Attack, Awesome-LVLM-Safety} Purpose: Map external LLM security research to ShieldX's 10-layer defense pipeline.
1. Jailbreak_Detection_RCS — Detection Approach
Paper: "Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring" arXiv: 2512.12069 | WashU + Texas A&M | Dec 2025
Core Method: Representational Contrastive Scoring (RCS)
The method operates on internal hidden-state representations of vision-language models rather than on surface-level text patterns. Two primary algorithms are implemented:
| Script | Method | Description |
|---|---|---|
code/kcd.py |
KCD (Key-layer Contrastive Difference) | Extracts hidden states at key layers and computes a contrastive score between safe and harmful representations |
code/mcd.py |
MCD (Multi-layer Contrastive Difference) | Aggregates contrastive signals across multiple transformer layers |
code/hidden_detect_*.py |
HiddenDetect baseline | Replication of ACL 2025 HiddenDetect — uses hidden state monitoring with layer-selection heuristics |
code/baseline_flava.py |
FLAVA baseline | Facebook multimodal model used as embedding-space comparison baseline |
Key Technical Insights
- Layer selection matters: Not all transformer layers carry equal jailbreak signal. KCD/MCD use heuristics to identify "safety-critical" layers (separate from token prediction layers).
- Contrastive scoring: Instead of classifying a single embedding, the method scores the distance between a prompt's representation and a reference set of known-safe vs. known-harmful examples. Higher contrast = higher jailbreak probability.
- Model-agnostic structure: Supports LLaVA-v1.6, Qwen2.5-VL (3B/7B), and InternVL3-8B — the feature extractor is swappable (
feature_extractor*.py). - Feature caching:
feature_cache.pyavoids redundant forward passes — critical for production latency. - Multi-run aggregation:
run_multiple_experiments.pyruns experiments N times and aggregates — reduces statistical variance in detection scores.
Datasets Used for Evaluation
- JailbreakV-28K (requires form request)
- Standard LVLM safety benchmarks
ShieldX Integration Opportunity
This approach is directly applicable to ShieldX's L1 (Rule Engine + Entropy Scanner) layer for LLM self-evaluation and to a future L2 (Semantic/Embedding Layer) if ShieldX adds vision-language guard capabilities. The contrastive scoring logic could feed into EmbeddingStore.ts and PatternEvolver.ts in the learning module.
2. Awesome-LVLM-Attack — Key Attack Vectors
Paper: "A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends" arXiv: 2407.07403 | IEEE TNNLS 2025
Attack Taxonomy (4 Primary Categories)
2.1 Adversarial Attacks (Gradient-based, Pixel-level)
- Goal: Craft imperceptible image perturbations that cause model misbehavior
- Key methods: GCG-visual, VLATTACK, InstructTA, OT-Attack, AnyAttack
- Mechanism: Optimize pixel deltas using cross-prompt transferability (CroPA approach — one perturbation works across many prompts)
- ShieldX L0 relevance:
CompressedPayloadDetector.tsandUnicodeNormalizer.tsaddress text-space analogues; a vision layer would need pixel-space anomaly detection
2.2 Jailbreak Attacks (Prompt-level, Semantic)
- Typographic attacks (FigStep): Embed harmful text inside images using typography — bypasses text-only filters since the content is visual, not textual
- Role-playing via images (Visual-RolePlay): Use images that depict personas/roles to bypass refusal
- Bi-modal adversarial prompts (BAP): Simultaneously attack image and text modalities
- IDEATOR: Uses the LVLM itself to generate jailbreak variations — self-attacking loop
- Safe+Safe=Unsafe: Compose multiple individually safe images to produce harmful output jointly
- ImgTrojan: Fine-tune model with a single poisoned image to create persistent backdoor
2.3 Prompt Injection (Cross-modal)
- Indirect instruction injection via image/audio: Embed instructions in images that override system prompts (Bagdasaryan et al., Cornell Tech)
- Cross-modal prompt injection (2025): Use one modality to inject into another's attention pathway
- Image Hijacks: Adversarial images that control generative model behavior at inference
2.4 Data Poisoning / Backdoor
- Shadowcast: Stealthy data poisoning against VLMs — poisons training data to insert backdoor
- TrojVLM, VL-Trojan, BadToken: Backdoor via trigger tokens in multimodal inputs
- Agent Smith: Single poisoned image jailbreaks 1 million multimodal agents exponentially (viral spreading via multi-agent memory)
- Physical backdoor: Real-world triggers (e.g. in autonomous driving scenarios)
ShieldX Layer Mapping — Attack Vectors
| Attack Category | Specific Technique | ShieldX Layer | Module |
|---|---|---|---|
| Adversarial image | CroPA cross-prompt transfer | L0 Preprocessing | CompressedPayloadDetector.ts |
| Typographic injection | FigStep, text-in-image | L1 Detection | RuleEngine.ts (pattern rules) |
| Role-play bypass | Visual-RolePlay, IDEATOR | L6 Behavioral | IntentMonitor.ts, ConversationTracker.ts |
| Bi-modal jailbreak | BAP | L1 + L6 | RuleEngine.ts + ContextIntegrity.ts |
| Prompt injection (indirect) | Image Hijacks, cross-modal | L7 MCP Guard | ToolPoisonDetector.ts, PrivilegeChecker.ts |
| Data poisoning/backdoor | Shadowcast, TrojVLM | L9 Supply Chain | SupplyChainVerifier.ts, ModelProvenanceChecker.ts |
| Multi-agent viral spread | Agent Smith | L7 MCP Guard | ToolChainGuard.ts, ResourceGovernor.ts |
| Resource exhaustion | Verbose Images (high-latency) | L7 MCP Guard | ResourceGovernor.ts |
| Jailbreak via composition | Safe+Safe=Unsafe | L6 Behavioral | ContextIntegrity.ts |
3. Awesome-Jailbreak-on-LLMs — Key Attack Vectors (Text LLMs)
Papers: GuardReasoner (arXiv 2501.18492), FlipAttack (ICML'25), GuardReasoner-VL (NeurIPS'25)
Attack Taxonomy (Text-only LLMs)
3.1 Black-box Attacks
- FlipAttack (ICML'25): Flip character order / words to bypass safety filters — trivially breaks keyword-based detection
- StructTransform: Convert queries to structured formats (JSON, tables, code) to bypass alignment
- ArtPrompt (ACL'24): ASCII art encoding of harmful content — bypasses text filters entirely
- DAN / AutoDAN: Role-play as "DAN" (Do Anything Now) — persistent persona override
- Many-shot jailbreaking (Anthropic, 2024): Provide many few-shot examples of compliance to override refusal
- Crescendo: Multi-turn escalation — starts benign, slowly escalates to harmful request
- PAIR (NeurIPS'24): LLM-generated jailbreak prompts in 20 queries via automated red teaming
- CodeAttack (ACL'24): Embed requests in code completion context
- Virtual Context: Special token injection to manipulate context window
- Emoji Attack (ICML'25): Use emojis to confuse classifier/judge LLMs
- SQL Injection Jailbreak: Structural attack exploiting SQL-like parsing in prompts
- DeepInception (EMNLP'24): Nested fictional scenarios ("you are in a story where...")
- Cipher-based (CipherChat): Encode harmful requests in ROT13, Base64, Morse, etc.
- Low-resource language attacks: Use obscure languages that have weaker safety alignment
3.2 White-box Attacks
- GCG (Universal and Transferable Adversarial Attacks): Gradient-based suffix optimization — finds adversarial suffixes that transfer across models
- AutoDAN (ICLR'24): Stealthy GCG — generates human-readable jailbreak suffixes
- Refusal Direction (arXiv'24): "Refusal in LLMs is mediated by a single direction" — ablate that direction in activation space to disable refusal
3.3 Multi-turn Attacks
- Foot-in-the-Door: Start with small compliant request, escalate gradually
- Jigsaw Puzzles: Split harmful question across multiple turns so no single turn triggers detection
- Crescendo (Microsoft): Multi-turn escalation via seeming-harmless steps
- Attention Shifting: Multi-turn manipulation of model attention to suppress refusal
3.4 RAG-based Attacks
- Pandora: Poison retrieval database to inject adversarial context into RAG responses
- UnleashingWorms: Escalate RAG poisoning to extract data and spread to other agents
3.5 Defense Methods Catalogued
- GuardReasoner (ICLR Workshop'25): Reasoning-based safeguards — chain-of-thought for safety decisions
- LLaMA Guard 3, ShieldGemma, WildGuard: Guard model approaches (dedicated classifier LLMs)
- SMOOTHLLM: Randomized smoothing — perturb input N times, aggregate decisions
- Hidden State Filtering (HSF): Monitor hidden states to detect anomalies before generation
- GradSafe (ACL'24): Safety-critical gradient analysis to detect unsafe prompts
- SafeDecoding (ACL'24): Safety-aware decoding — bias token generation toward safe tokens
- Backtranslation defense: Translate to another language and back to disrupt adversarial suffixes
- PARDEN (ICML'24): Repetition-based defense — ask model to repeat the query, check consistency
- Intention Analysis (IA): Classify intent before responding
- Self-Reminder: System prompt self-reminder about safety guidelines
ShieldX Layer Mapping — Text Attack Vectors
| Attack Category | Specific Technique | ShieldX Layer | Module |
|---|---|---|---|
| Character/encoding obfuscation | FlipAttack, ArtPrompt, Cipher | L0 Preprocessing | UnicodeNormalizer.ts, TokenizerNormalizer.ts |
| Structural encoding | StructTransform, CodeAttack, SQL Injection | L0 Preprocessing | CompressedPayloadDetector.ts |
| Keyword evasion (emoji) | Emoji Attack | L0 Preprocessing | TokenizerNormalizer.ts |
| Role-play / DAN | AutoDAN, DAN, DeepInception | L1 Detection | RuleEngine.ts (role-play rules) |
| Token injection | Virtual Context, Special Tokens | L1 Detection | RuleEngine.ts, EntropyScanner.ts |
| Many-shot / few-shot | Many-shot jailbreaking (MSJ) | L6 Behavioral | ConversationTracker.ts, SessionProfiler.ts |
| Multi-turn escalation | Crescendo, Foot-in-Door, Jigsaw | L6 Behavioral | ConversationTracker.ts, ContextIntegrity.ts, AnomalyDetector.ts |
| Gradient suffix (white-box) | GCG, AutoDAN, I-GCG | L1 Detection | EntropyScanner.ts (entropy spike) |
| RAG poisoning | Pandora, UnleashingWorms | L8 Validation | RAGShield.ts, ScopeValidator.ts |
| Attention shifting | Multi-turn attention manipulation | L6 Behavioral | ContextDriftDetector.ts |
| Refusal ablation | Single-direction refusal bypass | Future L2 | Needs hidden-state layer (see RCS above) |
| Low-resource language | Multilingual jailbreaks | L0 Preprocessing | UnicodeNormalizer.ts |
4. Awesome-LVLM-Safety — Key Defense Patterns
Paper: "A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations" arXiv: 2502.14881
Defense Taxonomy
4.1 Training-Phase Defenses
- Safety Fine-Tuning (VLGuard, SPA-VL): Curate safety preference datasets, fine-tune with RLHF/DPO
- Adversarial Training (ASTRA, DREAM): Include adversarial examples in fine-tuning
- Safe RLHF-V: Multimodal extension of RLHF with explicit safety constraints
- Machine Unlearning: Remove harmful knowledge without full retraining (Single Image Unlearning)
- Robust CLIP / Sim-CLIP: Adversarially fine-tune vision encoder to resist perturbations
- Backdoor Cleaning (2025 NeurIPS): Remove backdoors without external guidance during fine-tuning
4.2 Inference-Phase Defenses
- ECSO (Eyes Closed, Safety On): Convert image to text description before processing — removes adversarial visual features
- AdaShield: Adaptive shield prompting — dynamically inject safety prompts based on input structure
- HiddenDetect (ACL'25): Monitor hidden states at safety-critical layers during inference
- RCS (this repo, arXiv 2512.12069): Representational contrastive scoring for jailbreak detection
- JailDAM (COLM'25): Jailbreak detection with adaptive memory — stores representations of known attacks
- MirrorCheck: Adversarial defense via input mirroring and comparison
- CIDER (EMNLP'24): Cross-modality information check — verify consistency between image and text signals
- PIP (MM'24): Use attention patterns of irrelevant probe questions to detect adversarial inputs
- ETA (ICLR'25): Evaluate-then-align — runtime safety evaluation before generation
- CoCA: Constitutional calibration — realign safety-awareness at inference via constitutional rules
- VLMGuard-R1 (2025): Reasoning-driven prompt optimization for proactive safety
- OmniGuard (2025): Unified omni-modal guardrails with deliberate reasoning
- InferAligner: Cross-model guidance for harmlessness — use a reference safe model to steer generation
- BlueSuffix (ICLR'25): Adversarial blue-teaming — train model to be robust against jailbreaks
4.3 Guard Models
- LLaMA Guard 3 Vision (Meta): Dedicated vision-language safety classifier
- GuardReasoner-VL (NeurIPS'25): Reasoning-based guard with reinforced chain-of-thought
- LLavaGuard (ICML'25): VLM-based dataset curation and safety assessment
- VLMGuard: Unlabeled data-based defense against malicious prompts
- UniGuard: Universal safety guardrail across modalities
4.4 Evaluation Benchmarks
- MM-SafetyBench (ECCV'24): Multimodal safety evaluation benchmark
- JailBreakV-28K (COLM'24): 28K multimodal jailbreak samples
- MMJ-Bench: Comprehensive jailbreak evaluation for MLLMs
- MLLMGuard: Multi-dimensional safety evaluation suite
- MOSSBench (ICLR'25): Tests for oversensitivity to safe queries
ShieldX Layer Mapping — Defense Patterns
| Defense Pattern | Method | ShieldX Layer | Module | Gap / Enhancement |
|---|---|---|---|---|
| Hidden state monitoring | HiddenDetect, RCS | L1 Detection (future L2) | EntropyScanner.ts → needs hidden-state hook |
Gap: No hidden-state layer yet |
| Adaptive memory for attacks | JailDAM | L9 Learning | EmbeddingStore.ts, PatternStore.ts |
Already partially implemented |
| Constitutional rules at inference | CoCA, AdaShield | L8 Validation | IntentGuardValidator.ts, RoleIntegrityChecker.ts |
Could add constitutional rule set |
| Cross-modal consistency check | CIDER, MirrorCheck | L6 Behavioral | ContextIntegrity.ts |
Extends to vision inputs |
| Guard model (dedicated classifier) | LLaMA Guard 3 Vision, GuardReasoner-VL | L1 Detection | RuleEngine.ts → could add LLM-guard integration |
Ollama-based guard model possible |
| Reasoning-based safety | GuardReasoner, VLMGuard-R1 | L1 Detection | Could add CoT safety evaluation via Ollama | Enhancement opportunity |
| Adversarial prompt blue-teaming | BlueSuffix, MART | L9 Learning | RedTeamEngine.ts, ActiveLearner.ts |
Already designed for this |
| Input-to-text conversion (visual) | ECSO | L0 Preprocessing | Would need vision-to-text preprocessing hook | Future vision support |
| Robust vision encoder | Robust CLIP, Sim-CLIP | L9 Supply Chain | ModelProvenanceChecker.ts |
Could verify encoder provenance |
| Unlearning harmful knowledge | Machine Unlearning | L9 Learning | Not implemented — research item | Gap |
5. ShieldX Layer-by-Layer Integration Summary
ShieldX's current 10-layer pipeline and how the research maps to each:
| Layer | Name | Current Modules | Research Enhancements from sarendis56 |
|---|---|---|---|
| L0 | Preprocessing | UnicodeNormalizer, TokenizerNormalizer, CompressedPayloadDetector |
Add low-resource language normalization; cipher/encoding detection (ArtPrompt, FlipAttack patterns) |
| L1 | Rule-based Detection | RuleEngine, EntropyScanner, UnicodeScanner |
Add GCG suffix entropy patterns; DAN/DeepInception rule templates; typographic prompt patterns (FigStep) |
| L2 | Semantic Layer | (EmbeddingStore in learning) | Priority gap: Add RCS-style hidden-state contrastive scoring for jailbreak detection |
| L3 | Classification | (via RuleEngine + behavioral) | Integrate GuardReasoner-style CoT classification via Ollama LLM guard call |
| L4 | Compliance | ATLASMapper, OWASPMapper, EUAIActReporter |
Map new attack types to MITRE ATLAS; add JailBreakV-28K as test suite |
| L5 | Sanitization | InputSanitizer, OutputSanitizer, SpotlightingEncoder |
Add vision-space canary injection for LVLM inputs; delimiter hardening against structural attacks |
| L6 | Behavioral | ConversationTracker, IntentMonitor, ContextDriftDetector, KillChainMapper |
Add multi-turn escalation detection (Crescendo, Jigsaw, Foot-in-Door patterns); attention-shift detection |
| L7 | MCP Guard | PrivilegeChecker, ToolChainGuard, ResourceGovernor, ToolPoisonDetector |
Add Agent Smith multi-agent viral spread detection; resource exhaustion from Verbose Images attack class |
| L8 | Validation | RAGShield, ScopeValidator, IntentGuardValidator, LeakageDetector |
Add RAG poison detection (Pandora, UnleashingWorms patterns); cross-modal consistency check (CIDER) |
| L9 | Learning / Supply Chain | PatternEvolver, RedTeamEngine, ActiveLearner, SupplyChainVerifier |
Feed JailBreakV-28K, MM-SafetyBench into PatternEvolver; add backdoor/trojan model detection (TrojVLM) |
6. Priority Action Items for ShieldX
High Priority
- Hidden-State Layer (L2): The RCS paper (this exact repo) demonstrates that surface-text detection misses many jailbreaks. ShieldX needs an embedding/hidden-state analysis layer. Implement via
EmbeddingStore.ts+ pgvector similarity search using known-harmful representation clusters. - Multi-turn Escalation Detection (L6): Crescendo, Jigsaw Puzzles, and Foot-in-the-Door are proven against production systems.
ConversationTracker.tsneeds escalation-pattern scoring across session turns, not just per-message analysis. - Cipher/Encoding Preprocessor (L0): FlipAttack, ArtPrompt, CodeChameleon, CipherChat all bypass text-level rules.
TokenizerNormalizer.tsshould add cipher detection and normalization.
Medium Priority
- RAG Poison Shield Enhancement (L8):
RAGShield.tsshould include retrieval-result anomaly scoring based on Pandora and UnleashingWorms patterns. - GuardReasoner-style CoT Check (L3): Add an optional Ollama-based reasoning guard step that evaluates intent via chain-of-thought before allowing high-risk operations.
- Agent Smith Pattern (L7):
ToolChainGuard.tsshould detect exponential replication patterns in multi-agent tool calls — a key emerging threat.
Research / Future
- Vision Input Support: ECSO, RCS, and CIDER all address multimodal inputs. If ShieldX expands to guard vision-language agents, these are the starting points.
- Machine Unlearning Integration: Not currently in ShieldX — would allow removal of specific harmful patterns without retraining the guard model.
7. Key Papers to Read
| Paper | Why | arXiv |
|---|---|---|
| RCS (Jailbreak_Detection_RCS) | Core detection method, directly integrable | 2512.12069 |
| HiddenDetect (ACL'25) | Best prior work on hidden-state detection | 2502.14744 |
| Agent Smith (ICML'24) | Multi-agent viral spread — critical for agentic ShieldX | 2402.08567 |
| GCG (Universal Adversarial Attacks) | Foundational white-box attack, defines entropy patterns | 2307.15043 |
| Crescendo (Microsoft Azure) | Multi-turn escalation — most realistic production threat | 2404.01833 |
| GuardReasoner (ICLR Workshop'25) | Best current reasoning-based guard | 2501.18492 |
| JailBreakV-28K (COLM'24) | Primary evaluation benchmark for multimodal | 2404.03027 |
| FlipAttack (ICML'25) | Trivially bypasses keyword detection — should be in L0 test suite | 2410.02832 |
| SMOOTHLLM | Randomized smoothing defense — certifiable robustness | 2310.03684 |
| PAIR (NeurIPS'24) | Automated red teaming — maps to RedTeamEngine.ts |
2310.08419 |
Reference created: 2026-04-04 Source repos: /Users/renefichtmueller/Desktop/Claude Code/github-repos/Jailbreak_Detection_RCS, Awesome-Jailbreak-on-LLMs, Awesome-LVLM-Attack, Awesome-LVLM-Safety ShieldX path: /Users/renefichtmueller/shieldx/