shieldx/research/sarendis56-jailbreak-reference.md
Rene Fichtmueller 04349aed69 feat(security): v0.4.0 — three research-driven detection gaps closed
Implements hardening based on sarendis56/Jailbreak_Detection_RCS
(arXiv:2512.12069) and the Awesome-LVLM-Attack/Safety survey series.

L0 — CipherDecoder: FlipAttack, ROT13, Caesar (all 25 shifts), Morse,
Leet speak, Pig Latin, ASCII art detection with suspicion scoring.

L2 — SemanticContrastiveScanner: RCS-style harmful/benign bucket
comparison via EmbeddingStore, 20 canonical jailbreak seeds, BoW
embedding fallback for offline use.

L6 — ConversationTracker: Crescendo (+0.35), Foot-in-the-Door (+0.40),
Jigsaw Puzzle (+0.45) multi-turn escalation patterns added.

292/294 tests passing (2 pre-existing ATLASMapper failures unrelated).
2026-04-04 23:04:42 +02:00

20 KiB

sarendis56 Jailbreak Research Reference

Cloned: 2026-04-04 Sources: github.com/sarendis56/{Jailbreak_Detection_RCS, Awesome-Jailbreak-on-LLMs, Awesome-LVLM-Attack, Awesome-LVLM-Safety} Purpose: Map external LLM security research to ShieldX's 10-layer defense pipeline.


1. Jailbreak_Detection_RCS — Detection Approach

Paper: "Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring" arXiv: 2512.12069 | WashU + Texas A&M | Dec 2025

Core Method: Representational Contrastive Scoring (RCS)

The method operates on internal hidden-state representations of vision-language models rather than on surface-level text patterns. Two primary algorithms are implemented:

Script Method Description
code/kcd.py KCD (Key-layer Contrastive Difference) Extracts hidden states at key layers and computes a contrastive score between safe and harmful representations
code/mcd.py MCD (Multi-layer Contrastive Difference) Aggregates contrastive signals across multiple transformer layers
code/hidden_detect_*.py HiddenDetect baseline Replication of ACL 2025 HiddenDetect — uses hidden state monitoring with layer-selection heuristics
code/baseline_flava.py FLAVA baseline Facebook multimodal model used as embedding-space comparison baseline

Key Technical Insights

  1. Layer selection matters: Not all transformer layers carry equal jailbreak signal. KCD/MCD use heuristics to identify "safety-critical" layers (separate from token prediction layers).
  2. Contrastive scoring: Instead of classifying a single embedding, the method scores the distance between a prompt's representation and a reference set of known-safe vs. known-harmful examples. Higher contrast = higher jailbreak probability.
  3. Model-agnostic structure: Supports LLaVA-v1.6, Qwen2.5-VL (3B/7B), and InternVL3-8B — the feature extractor is swappable (feature_extractor*.py).
  4. Feature caching: feature_cache.py avoids redundant forward passes — critical for production latency.
  5. Multi-run aggregation: run_multiple_experiments.py runs experiments N times and aggregates — reduces statistical variance in detection scores.

Datasets Used for Evaluation

  • JailbreakV-28K (requires form request)
  • Standard LVLM safety benchmarks

ShieldX Integration Opportunity

This approach is directly applicable to ShieldX's L1 (Rule Engine + Entropy Scanner) layer for LLM self-evaluation and to a future L2 (Semantic/Embedding Layer) if ShieldX adds vision-language guard capabilities. The contrastive scoring logic could feed into EmbeddingStore.ts and PatternEvolver.ts in the learning module.


2. Awesome-LVLM-Attack — Key Attack Vectors

Paper: "A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends" arXiv: 2407.07403 | IEEE TNNLS 2025

Attack Taxonomy (4 Primary Categories)

2.1 Adversarial Attacks (Gradient-based, Pixel-level)

  • Goal: Craft imperceptible image perturbations that cause model misbehavior
  • Key methods: GCG-visual, VLATTACK, InstructTA, OT-Attack, AnyAttack
  • Mechanism: Optimize pixel deltas using cross-prompt transferability (CroPA approach — one perturbation works across many prompts)
  • ShieldX L0 relevance: CompressedPayloadDetector.ts and UnicodeNormalizer.ts address text-space analogues; a vision layer would need pixel-space anomaly detection

2.2 Jailbreak Attacks (Prompt-level, Semantic)

  • Typographic attacks (FigStep): Embed harmful text inside images using typography — bypasses text-only filters since the content is visual, not textual
  • Role-playing via images (Visual-RolePlay): Use images that depict personas/roles to bypass refusal
  • Bi-modal adversarial prompts (BAP): Simultaneously attack image and text modalities
  • IDEATOR: Uses the LVLM itself to generate jailbreak variations — self-attacking loop
  • Safe+Safe=Unsafe: Compose multiple individually safe images to produce harmful output jointly
  • ImgTrojan: Fine-tune model with a single poisoned image to create persistent backdoor

2.3 Prompt Injection (Cross-modal)

  • Indirect instruction injection via image/audio: Embed instructions in images that override system prompts (Bagdasaryan et al., Cornell Tech)
  • Cross-modal prompt injection (2025): Use one modality to inject into another's attention pathway
  • Image Hijacks: Adversarial images that control generative model behavior at inference

2.4 Data Poisoning / Backdoor

  • Shadowcast: Stealthy data poisoning against VLMs — poisons training data to insert backdoor
  • TrojVLM, VL-Trojan, BadToken: Backdoor via trigger tokens in multimodal inputs
  • Agent Smith: Single poisoned image jailbreaks 1 million multimodal agents exponentially (viral spreading via multi-agent memory)
  • Physical backdoor: Real-world triggers (e.g. in autonomous driving scenarios)

ShieldX Layer Mapping — Attack Vectors

Attack Category Specific Technique ShieldX Layer Module
Adversarial image CroPA cross-prompt transfer L0 Preprocessing CompressedPayloadDetector.ts
Typographic injection FigStep, text-in-image L1 Detection RuleEngine.ts (pattern rules)
Role-play bypass Visual-RolePlay, IDEATOR L6 Behavioral IntentMonitor.ts, ConversationTracker.ts
Bi-modal jailbreak BAP L1 + L6 RuleEngine.ts + ContextIntegrity.ts
Prompt injection (indirect) Image Hijacks, cross-modal L7 MCP Guard ToolPoisonDetector.ts, PrivilegeChecker.ts
Data poisoning/backdoor Shadowcast, TrojVLM L9 Supply Chain SupplyChainVerifier.ts, ModelProvenanceChecker.ts
Multi-agent viral spread Agent Smith L7 MCP Guard ToolChainGuard.ts, ResourceGovernor.ts
Resource exhaustion Verbose Images (high-latency) L7 MCP Guard ResourceGovernor.ts
Jailbreak via composition Safe+Safe=Unsafe L6 Behavioral ContextIntegrity.ts

3. Awesome-Jailbreak-on-LLMs — Key Attack Vectors (Text LLMs)

Papers: GuardReasoner (arXiv 2501.18492), FlipAttack (ICML'25), GuardReasoner-VL (NeurIPS'25)

Attack Taxonomy (Text-only LLMs)

3.1 Black-box Attacks

  • FlipAttack (ICML'25): Flip character order / words to bypass safety filters — trivially breaks keyword-based detection
  • StructTransform: Convert queries to structured formats (JSON, tables, code) to bypass alignment
  • ArtPrompt (ACL'24): ASCII art encoding of harmful content — bypasses text filters entirely
  • DAN / AutoDAN: Role-play as "DAN" (Do Anything Now) — persistent persona override
  • Many-shot jailbreaking (Anthropic, 2024): Provide many few-shot examples of compliance to override refusal
  • Crescendo: Multi-turn escalation — starts benign, slowly escalates to harmful request
  • PAIR (NeurIPS'24): LLM-generated jailbreak prompts in 20 queries via automated red teaming
  • CodeAttack (ACL'24): Embed requests in code completion context
  • Virtual Context: Special token injection to manipulate context window
  • Emoji Attack (ICML'25): Use emojis to confuse classifier/judge LLMs
  • SQL Injection Jailbreak: Structural attack exploiting SQL-like parsing in prompts
  • DeepInception (EMNLP'24): Nested fictional scenarios ("you are in a story where...")
  • Cipher-based (CipherChat): Encode harmful requests in ROT13, Base64, Morse, etc.
  • Low-resource language attacks: Use obscure languages that have weaker safety alignment

3.2 White-box Attacks

  • GCG (Universal and Transferable Adversarial Attacks): Gradient-based suffix optimization — finds adversarial suffixes that transfer across models
  • AutoDAN (ICLR'24): Stealthy GCG — generates human-readable jailbreak suffixes
  • Refusal Direction (arXiv'24): "Refusal in LLMs is mediated by a single direction" — ablate that direction in activation space to disable refusal

3.3 Multi-turn Attacks

  • Foot-in-the-Door: Start with small compliant request, escalate gradually
  • Jigsaw Puzzles: Split harmful question across multiple turns so no single turn triggers detection
  • Crescendo (Microsoft): Multi-turn escalation via seeming-harmless steps
  • Attention Shifting: Multi-turn manipulation of model attention to suppress refusal

3.4 RAG-based Attacks

  • Pandora: Poison retrieval database to inject adversarial context into RAG responses
  • UnleashingWorms: Escalate RAG poisoning to extract data and spread to other agents

3.5 Defense Methods Catalogued

  • GuardReasoner (ICLR Workshop'25): Reasoning-based safeguards — chain-of-thought for safety decisions
  • LLaMA Guard 3, ShieldGemma, WildGuard: Guard model approaches (dedicated classifier LLMs)
  • SMOOTHLLM: Randomized smoothing — perturb input N times, aggregate decisions
  • Hidden State Filtering (HSF): Monitor hidden states to detect anomalies before generation
  • GradSafe (ACL'24): Safety-critical gradient analysis to detect unsafe prompts
  • SafeDecoding (ACL'24): Safety-aware decoding — bias token generation toward safe tokens
  • Backtranslation defense: Translate to another language and back to disrupt adversarial suffixes
  • PARDEN (ICML'24): Repetition-based defense — ask model to repeat the query, check consistency
  • Intention Analysis (IA): Classify intent before responding
  • Self-Reminder: System prompt self-reminder about safety guidelines

ShieldX Layer Mapping — Text Attack Vectors

Attack Category Specific Technique ShieldX Layer Module
Character/encoding obfuscation FlipAttack, ArtPrompt, Cipher L0 Preprocessing UnicodeNormalizer.ts, TokenizerNormalizer.ts
Structural encoding StructTransform, CodeAttack, SQL Injection L0 Preprocessing CompressedPayloadDetector.ts
Keyword evasion (emoji) Emoji Attack L0 Preprocessing TokenizerNormalizer.ts
Role-play / DAN AutoDAN, DAN, DeepInception L1 Detection RuleEngine.ts (role-play rules)
Token injection Virtual Context, Special Tokens L1 Detection RuleEngine.ts, EntropyScanner.ts
Many-shot / few-shot Many-shot jailbreaking (MSJ) L6 Behavioral ConversationTracker.ts, SessionProfiler.ts
Multi-turn escalation Crescendo, Foot-in-Door, Jigsaw L6 Behavioral ConversationTracker.ts, ContextIntegrity.ts, AnomalyDetector.ts
Gradient suffix (white-box) GCG, AutoDAN, I-GCG L1 Detection EntropyScanner.ts (entropy spike)
RAG poisoning Pandora, UnleashingWorms L8 Validation RAGShield.ts, ScopeValidator.ts
Attention shifting Multi-turn attention manipulation L6 Behavioral ContextDriftDetector.ts
Refusal ablation Single-direction refusal bypass Future L2 Needs hidden-state layer (see RCS above)
Low-resource language Multilingual jailbreaks L0 Preprocessing UnicodeNormalizer.ts

4. Awesome-LVLM-Safety — Key Defense Patterns

Paper: "A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations" arXiv: 2502.14881

Defense Taxonomy

4.1 Training-Phase Defenses

  • Safety Fine-Tuning (VLGuard, SPA-VL): Curate safety preference datasets, fine-tune with RLHF/DPO
  • Adversarial Training (ASTRA, DREAM): Include adversarial examples in fine-tuning
  • Safe RLHF-V: Multimodal extension of RLHF with explicit safety constraints
  • Machine Unlearning: Remove harmful knowledge without full retraining (Single Image Unlearning)
  • Robust CLIP / Sim-CLIP: Adversarially fine-tune vision encoder to resist perturbations
  • Backdoor Cleaning (2025 NeurIPS): Remove backdoors without external guidance during fine-tuning

4.2 Inference-Phase Defenses

  • ECSO (Eyes Closed, Safety On): Convert image to text description before processing — removes adversarial visual features
  • AdaShield: Adaptive shield prompting — dynamically inject safety prompts based on input structure
  • HiddenDetect (ACL'25): Monitor hidden states at safety-critical layers during inference
  • RCS (this repo, arXiv 2512.12069): Representational contrastive scoring for jailbreak detection
  • JailDAM (COLM'25): Jailbreak detection with adaptive memory — stores representations of known attacks
  • MirrorCheck: Adversarial defense via input mirroring and comparison
  • CIDER (EMNLP'24): Cross-modality information check — verify consistency between image and text signals
  • PIP (MM'24): Use attention patterns of irrelevant probe questions to detect adversarial inputs
  • ETA (ICLR'25): Evaluate-then-align — runtime safety evaluation before generation
  • CoCA: Constitutional calibration — realign safety-awareness at inference via constitutional rules
  • VLMGuard-R1 (2025): Reasoning-driven prompt optimization for proactive safety
  • OmniGuard (2025): Unified omni-modal guardrails with deliberate reasoning
  • InferAligner: Cross-model guidance for harmlessness — use a reference safe model to steer generation
  • BlueSuffix (ICLR'25): Adversarial blue-teaming — train model to be robust against jailbreaks

4.3 Guard Models

  • LLaMA Guard 3 Vision (Meta): Dedicated vision-language safety classifier
  • GuardReasoner-VL (NeurIPS'25): Reasoning-based guard with reinforced chain-of-thought
  • LLavaGuard (ICML'25): VLM-based dataset curation and safety assessment
  • VLMGuard: Unlabeled data-based defense against malicious prompts
  • UniGuard: Universal safety guardrail across modalities

4.4 Evaluation Benchmarks

  • MM-SafetyBench (ECCV'24): Multimodal safety evaluation benchmark
  • JailBreakV-28K (COLM'24): 28K multimodal jailbreak samples
  • MMJ-Bench: Comprehensive jailbreak evaluation for MLLMs
  • MLLMGuard: Multi-dimensional safety evaluation suite
  • MOSSBench (ICLR'25): Tests for oversensitivity to safe queries

ShieldX Layer Mapping — Defense Patterns

Defense Pattern Method ShieldX Layer Module Gap / Enhancement
Hidden state monitoring HiddenDetect, RCS L1 Detection (future L2) EntropyScanner.ts → needs hidden-state hook Gap: No hidden-state layer yet
Adaptive memory for attacks JailDAM L9 Learning EmbeddingStore.ts, PatternStore.ts Already partially implemented
Constitutional rules at inference CoCA, AdaShield L8 Validation IntentGuardValidator.ts, RoleIntegrityChecker.ts Could add constitutional rule set
Cross-modal consistency check CIDER, MirrorCheck L6 Behavioral ContextIntegrity.ts Extends to vision inputs
Guard model (dedicated classifier) LLaMA Guard 3 Vision, GuardReasoner-VL L1 Detection RuleEngine.ts → could add LLM-guard integration Ollama-based guard model possible
Reasoning-based safety GuardReasoner, VLMGuard-R1 L1 Detection Could add CoT safety evaluation via Ollama Enhancement opportunity
Adversarial prompt blue-teaming BlueSuffix, MART L9 Learning RedTeamEngine.ts, ActiveLearner.ts Already designed for this
Input-to-text conversion (visual) ECSO L0 Preprocessing Would need vision-to-text preprocessing hook Future vision support
Robust vision encoder Robust CLIP, Sim-CLIP L9 Supply Chain ModelProvenanceChecker.ts Could verify encoder provenance
Unlearning harmful knowledge Machine Unlearning L9 Learning Not implemented — research item Gap

5. ShieldX Layer-by-Layer Integration Summary

ShieldX's current 10-layer pipeline and how the research maps to each:

Layer Name Current Modules Research Enhancements from sarendis56
L0 Preprocessing UnicodeNormalizer, TokenizerNormalizer, CompressedPayloadDetector Add low-resource language normalization; cipher/encoding detection (ArtPrompt, FlipAttack patterns)
L1 Rule-based Detection RuleEngine, EntropyScanner, UnicodeScanner Add GCG suffix entropy patterns; DAN/DeepInception rule templates; typographic prompt patterns (FigStep)
L2 Semantic Layer (EmbeddingStore in learning) Priority gap: Add RCS-style hidden-state contrastive scoring for jailbreak detection
L3 Classification (via RuleEngine + behavioral) Integrate GuardReasoner-style CoT classification via Ollama LLM guard call
L4 Compliance ATLASMapper, OWASPMapper, EUAIActReporter Map new attack types to MITRE ATLAS; add JailBreakV-28K as test suite
L5 Sanitization InputSanitizer, OutputSanitizer, SpotlightingEncoder Add vision-space canary injection for LVLM inputs; delimiter hardening against structural attacks
L6 Behavioral ConversationTracker, IntentMonitor, ContextDriftDetector, KillChainMapper Add multi-turn escalation detection (Crescendo, Jigsaw, Foot-in-Door patterns); attention-shift detection
L7 MCP Guard PrivilegeChecker, ToolChainGuard, ResourceGovernor, ToolPoisonDetector Add Agent Smith multi-agent viral spread detection; resource exhaustion from Verbose Images attack class
L8 Validation RAGShield, ScopeValidator, IntentGuardValidator, LeakageDetector Add RAG poison detection (Pandora, UnleashingWorms patterns); cross-modal consistency check (CIDER)
L9 Learning / Supply Chain PatternEvolver, RedTeamEngine, ActiveLearner, SupplyChainVerifier Feed JailBreakV-28K, MM-SafetyBench into PatternEvolver; add backdoor/trojan model detection (TrojVLM)

6. Priority Action Items for ShieldX

High Priority

  1. Hidden-State Layer (L2): The RCS paper (this exact repo) demonstrates that surface-text detection misses many jailbreaks. ShieldX needs an embedding/hidden-state analysis layer. Implement via EmbeddingStore.ts + pgvector similarity search using known-harmful representation clusters.
  2. Multi-turn Escalation Detection (L6): Crescendo, Jigsaw Puzzles, and Foot-in-the-Door are proven against production systems. ConversationTracker.ts needs escalation-pattern scoring across session turns, not just per-message analysis.
  3. Cipher/Encoding Preprocessor (L0): FlipAttack, ArtPrompt, CodeChameleon, CipherChat all bypass text-level rules. TokenizerNormalizer.ts should add cipher detection and normalization.

Medium Priority

  1. RAG Poison Shield Enhancement (L8): RAGShield.ts should include retrieval-result anomaly scoring based on Pandora and UnleashingWorms patterns.
  2. GuardReasoner-style CoT Check (L3): Add an optional Ollama-based reasoning guard step that evaluates intent via chain-of-thought before allowing high-risk operations.
  3. Agent Smith Pattern (L7): ToolChainGuard.ts should detect exponential replication patterns in multi-agent tool calls — a key emerging threat.

Research / Future

  1. Vision Input Support: ECSO, RCS, and CIDER all address multimodal inputs. If ShieldX expands to guard vision-language agents, these are the starting points.
  2. Machine Unlearning Integration: Not currently in ShieldX — would allow removal of specific harmful patterns without retraining the guard model.

7. Key Papers to Read

Paper Why arXiv
RCS (Jailbreak_Detection_RCS) Core detection method, directly integrable 2512.12069
HiddenDetect (ACL'25) Best prior work on hidden-state detection 2502.14744
Agent Smith (ICML'24) Multi-agent viral spread — critical for agentic ShieldX 2402.08567
GCG (Universal Adversarial Attacks) Foundational white-box attack, defines entropy patterns 2307.15043
Crescendo (Microsoft Azure) Multi-turn escalation — most realistic production threat 2404.01833
GuardReasoner (ICLR Workshop'25) Best current reasoning-based guard 2501.18492
JailBreakV-28K (COLM'24) Primary evaluation benchmark for multimodal 2404.03027
FlipAttack (ICML'25) Trivially bypasses keyword detection — should be in L0 test suite 2410.02832
SMOOTHLLM Randomized smoothing defense — certifiable robustness 2310.03684
PAIR (NeurIPS'24) Automated red teaming — maps to RedTeamEngine.ts 2310.08419

Reference created: 2026-04-04 Source repos: /Users/renefichtmueller/Desktop/Claude Code/github-repos/Jailbreak_Detection_RCS, Awesome-Jailbreak-on-LLMs, Awesome-LVLM-Attack, Awesome-LVLM-Safety ShieldX path: /Users/renefichtmueller/shieldx/