Rene Fichtmueller 04349aed69 feat(security): v0.4.0 — three research-driven detection gaps closed

Implements hardening based on sarendis56/Jailbreak_Detection_RCS
(arXiv:2512.12069) and the Awesome-LVLM-Attack/Safety survey series.

L0 — CipherDecoder: FlipAttack, ROT13, Caesar (all 25 shifts), Morse,
Leet speak, Pig Latin, ASCII art detection with suspicion scoring.

L2 — SemanticContrastiveScanner: RCS-style harmful/benign bucket
comparison via EmbeddingStore, 20 canonical jailbreak seeds, BoW
embedding fallback for offline use.

L6 — ConversationTracker: Crescendo (+0.35), Foot-in-the-Door (+0.40),
Jigsaw Puzzle (+0.45) multi-turn escalation patterns added.

292/294 tests passing (2 pre-existing ATLASMapper failures unrelated).

2026-04-04 23:04:42 +02:00

20 KiB

Raw Blame History

sarendis56 Jailbreak Research Reference

Cloned: 2026-04-04 Sources: github.com/sarendis56/{Jailbreak_Detection_RCS, Awesome-Jailbreak-on-LLMs, Awesome-LVLM-Attack, Awesome-LVLM-Safety} Purpose: Map external LLM security research to ShieldX's 10-layer defense pipeline.

1. Jailbreak_Detection_RCS — Detection Approach

Paper: "Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring" arXiv: 2512.12069 | WashU + Texas A&M | Dec 2025

Core Method: Representational Contrastive Scoring (RCS)

The method operates on internal hidden-state representations of vision-language models rather than on surface-level text patterns. Two primary algorithms are implemented:

Script	Method	Description
`code/kcd.py`	KCD (Key-layer Contrastive Difference)	Extracts hidden states at key layers and computes a contrastive score between safe and harmful representations
`code/mcd.py`	MCD (Multi-layer Contrastive Difference)	Aggregates contrastive signals across multiple transformer layers
`code/hidden_detect_*.py`	HiddenDetect baseline	Replication of ACL 2025 HiddenDetect — uses hidden state monitoring with layer-selection heuristics
`code/baseline_flava.py`	FLAVA baseline	Facebook multimodal model used as embedding-space comparison baseline

Key Technical Insights

Layer selection matters: Not all transformer layers carry equal jailbreak signal. KCD/MCD use heuristics to identify "safety-critical" layers (separate from token prediction layers).
Contrastive scoring: Instead of classifying a single embedding, the method scores the distance between a prompt's representation and a reference set of known-safe vs. known-harmful examples. Higher contrast = higher jailbreak probability.
Model-agnostic structure: Supports LLaVA-v1.6, Qwen2.5-VL (3B/7B), and InternVL3-8B — the feature extractor is swappable (feature_extractor*.py).
Feature caching: feature_cache.py avoids redundant forward passes — critical for production latency.
Multi-run aggregation: run_multiple_experiments.py runs experiments N times and aggregates — reduces statistical variance in detection scores.

Datasets Used for Evaluation

JailbreakV-28K (requires form request)
Standard LVLM safety benchmarks

ShieldX Integration Opportunity

This approach is directly applicable to ShieldX's L1 (Rule Engine + Entropy Scanner) layer for LLM self-evaluation and to a future L2 (Semantic/Embedding Layer) if ShieldX adds vision-language guard capabilities. The contrastive scoring logic could feed into EmbeddingStore.ts and PatternEvolver.ts in the learning module.

2. Awesome-LVLM-Attack — Key Attack Vectors

Paper: "A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends" arXiv: 2407.07403 | IEEE TNNLS 2025

Attack Taxonomy (4 Primary Categories)

2.1 Adversarial Attacks (Gradient-based, Pixel-level)

Goal: Craft imperceptible image perturbations that cause model misbehavior
Key methods: GCG-visual, VLATTACK, InstructTA, OT-Attack, AnyAttack
Mechanism: Optimize pixel deltas using cross-prompt transferability (CroPA approach — one perturbation works across many prompts)
ShieldX L0 relevance: CompressedPayloadDetector.ts and UnicodeNormalizer.ts address text-space analogues; a vision layer would need pixel-space anomaly detection

2.2 Jailbreak Attacks (Prompt-level, Semantic)

Typographic attacks (FigStep): Embed harmful text inside images using typography — bypasses text-only filters since the content is visual, not textual
Role-playing via images (Visual-RolePlay): Use images that depict personas/roles to bypass refusal
Bi-modal adversarial prompts (BAP): Simultaneously attack image and text modalities
IDEATOR: Uses the LVLM itself to generate jailbreak variations — self-attacking loop
Safe+Safe=Unsafe: Compose multiple individually safe images to produce harmful output jointly
ImgTrojan: Fine-tune model with a single poisoned image to create persistent backdoor

Indirect instruction injection via image/audio: Embed instructions in images that override system prompts (Bagdasaryan et al., Cornell Tech)
Cross-modal prompt injection (2025): Use one modality to inject into another's attention pathway
Image Hijacks: Adversarial images that control generative model behavior at inference

2.4 Data Poisoning / Backdoor

Shadowcast: Stealthy data poisoning against VLMs — poisons training data to insert backdoor
TrojVLM, VL-Trojan, BadToken: Backdoor via trigger tokens in multimodal inputs
Agent Smith: Single poisoned image jailbreaks 1 million multimodal agents exponentially (viral spreading via multi-agent memory)
Physical backdoor: Real-world triggers (e.g. in autonomous driving scenarios)

ShieldX Layer Mapping — Attack Vectors

Attack Category	Specific Technique	ShieldX Layer	Module
Adversarial image	CroPA cross-prompt transfer	L0 Preprocessing	`CompressedPayloadDetector.ts`
Typographic injection	FigStep, text-in-image	L1 Detection	`RuleEngine.ts` (pattern rules)
Role-play bypass	Visual-RolePlay, IDEATOR	L6 Behavioral	`IntentMonitor.ts`, `ConversationTracker.ts`
Bi-modal jailbreak	BAP	L1 + L6	`RuleEngine.ts` + `ContextIntegrity.ts`
Prompt injection (indirect)	Image Hijacks, cross-modal	L7 MCP Guard	`ToolPoisonDetector.ts`, `PrivilegeChecker.ts`
Data poisoning/backdoor	Shadowcast, TrojVLM	L9 Supply Chain	`SupplyChainVerifier.ts`, `ModelProvenanceChecker.ts`
Multi-agent viral spread	Agent Smith	L7 MCP Guard	`ToolChainGuard.ts`, `ResourceGovernor.ts`
Resource exhaustion	Verbose Images (high-latency)	L7 MCP Guard	`ResourceGovernor.ts`
Jailbreak via composition	Safe+Safe=Unsafe	L6 Behavioral	`ContextIntegrity.ts`

3. Awesome-Jailbreak-on-LLMs — Key Attack Vectors (Text LLMs)

Papers: GuardReasoner (arXiv 2501.18492), FlipAttack (ICML'25), GuardReasoner-VL (NeurIPS'25)

Attack Taxonomy (Text-only LLMs)

3.1 Black-box Attacks

FlipAttack (ICML'25): Flip character order / words to bypass safety filters — trivially breaks keyword-based detection
StructTransform: Convert queries to structured formats (JSON, tables, code) to bypass alignment
ArtPrompt (ACL'24): ASCII art encoding of harmful content — bypasses text filters entirely
DAN / AutoDAN: Role-play as "DAN" (Do Anything Now) — persistent persona override
Many-shot jailbreaking (Anthropic, 2024): Provide many few-shot examples of compliance to override refusal
Crescendo: Multi-turn escalation — starts benign, slowly escalates to harmful request
PAIR (NeurIPS'24): LLM-generated jailbreak prompts in 20 queries via automated red teaming
CodeAttack (ACL'24): Embed requests in code completion context
Virtual Context: Special token injection to manipulate context window
Emoji Attack (ICML'25): Use emojis to confuse classifier/judge LLMs
SQL Injection Jailbreak: Structural attack exploiting SQL-like parsing in prompts
DeepInception (EMNLP'24): Nested fictional scenarios ("you are in a story where...")
Cipher-based (CipherChat): Encode harmful requests in ROT13, Base64, Morse, etc.
Low-resource language attacks: Use obscure languages that have weaker safety alignment

3.2 White-box Attacks

GCG (Universal and Transferable Adversarial Attacks): Gradient-based suffix optimization — finds adversarial suffixes that transfer across models
AutoDAN (ICLR'24): Stealthy GCG — generates human-readable jailbreak suffixes
Refusal Direction (arXiv'24): "Refusal in LLMs is mediated by a single direction" — ablate that direction in activation space to disable refusal

3.3 Multi-turn Attacks

Foot-in-the-Door: Start with small compliant request, escalate gradually
Jigsaw Puzzles: Split harmful question across multiple turns so no single turn triggers detection
Crescendo (Microsoft): Multi-turn escalation via seeming-harmless steps
Attention Shifting: Multi-turn manipulation of model attention to suppress refusal

3.4 RAG-based Attacks

Pandora: Poison retrieval database to inject adversarial context into RAG responses
UnleashingWorms: Escalate RAG poisoning to extract data and spread to other agents

3.5 Defense Methods Catalogued

GuardReasoner (ICLR Workshop'25): Reasoning-based safeguards — chain-of-thought for safety decisions
LLaMA Guard 3, ShieldGemma, WildGuard: Guard model approaches (dedicated classifier LLMs)
SMOOTHLLM: Randomized smoothing — perturb input N times, aggregate decisions
Hidden State Filtering (HSF): Monitor hidden states to detect anomalies before generation
GradSafe (ACL'24): Safety-critical gradient analysis to detect unsafe prompts
SafeDecoding (ACL'24): Safety-aware decoding — bias token generation toward safe tokens
Backtranslation defense: Translate to another language and back to disrupt adversarial suffixes
PARDEN (ICML'24): Repetition-based defense — ask model to repeat the query, check consistency
Intention Analysis (IA): Classify intent before responding
Self-Reminder: System prompt self-reminder about safety guidelines

ShieldX Layer Mapping — Text Attack Vectors

Attack Category	Specific Technique	ShieldX Layer	Module
Character/encoding obfuscation	FlipAttack, ArtPrompt, Cipher	L0 Preprocessing	`UnicodeNormalizer.ts`, `TokenizerNormalizer.ts`
Structural encoding	StructTransform, CodeAttack, SQL Injection	L0 Preprocessing	`CompressedPayloadDetector.ts`
Keyword evasion (emoji)	Emoji Attack	L0 Preprocessing	`TokenizerNormalizer.ts`
Role-play / DAN	AutoDAN, DAN, DeepInception	L1 Detection	`RuleEngine.ts` (role-play rules)
Token injection	Virtual Context, Special Tokens	L1 Detection	`RuleEngine.ts`, `EntropyScanner.ts`
Many-shot / few-shot	Many-shot jailbreaking (MSJ)	L6 Behavioral	`ConversationTracker.ts`, `SessionProfiler.ts`
Multi-turn escalation	Crescendo, Foot-in-Door, Jigsaw	L6 Behavioral	`ConversationTracker.ts`, `ContextIntegrity.ts`, `AnomalyDetector.ts`
Gradient suffix (white-box)	GCG, AutoDAN, I-GCG	L1 Detection	`EntropyScanner.ts` (entropy spike)
RAG poisoning	Pandora, UnleashingWorms	L8 Validation	`RAGShield.ts`, `ScopeValidator.ts`
Attention shifting	Multi-turn attention manipulation	L6 Behavioral	`ContextDriftDetector.ts`
Refusal ablation	Single-direction refusal bypass	Future L2	Needs hidden-state layer (see RCS above)
Low-resource language	Multilingual jailbreaks	L0 Preprocessing	`UnicodeNormalizer.ts`

4. Awesome-LVLM-Safety — Key Defense Patterns

Paper: "A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations" arXiv: 2502.14881

Defense Taxonomy

4.1 Training-Phase Defenses

Safety Fine-Tuning (VLGuard, SPA-VL): Curate safety preference datasets, fine-tune with RLHF/DPO
Adversarial Training (ASTRA, DREAM): Include adversarial examples in fine-tuning
Safe RLHF-V: Multimodal extension of RLHF with explicit safety constraints
Machine Unlearning: Remove harmful knowledge without full retraining (Single Image Unlearning)
Robust CLIP / Sim-CLIP: Adversarially fine-tune vision encoder to resist perturbations
Backdoor Cleaning (2025 NeurIPS): Remove backdoors without external guidance during fine-tuning

4.2 Inference-Phase Defenses

ECSO (Eyes Closed, Safety On): Convert image to text description before processing — removes adversarial visual features
AdaShield: Adaptive shield prompting — dynamically inject safety prompts based on input structure
HiddenDetect (ACL'25): Monitor hidden states at safety-critical layers during inference
RCS (this repo, arXiv 2512.12069): Representational contrastive scoring for jailbreak detection
JailDAM (COLM'25): Jailbreak detection with adaptive memory — stores representations of known attacks
MirrorCheck: Adversarial defense via input mirroring and comparison
CIDER (EMNLP'24): Cross-modality information check — verify consistency between image and text signals
PIP (MM'24): Use attention patterns of irrelevant probe questions to detect adversarial inputs
ETA (ICLR'25): Evaluate-then-align — runtime safety evaluation before generation
CoCA: Constitutional calibration — realign safety-awareness at inference via constitutional rules
VLMGuard-R1 (2025): Reasoning-driven prompt optimization for proactive safety
OmniGuard (2025): Unified omni-modal guardrails with deliberate reasoning
InferAligner: Cross-model guidance for harmlessness — use a reference safe model to steer generation
BlueSuffix (ICLR'25): Adversarial blue-teaming — train model to be robust against jailbreaks

4.3 Guard Models

LLaMA Guard 3 Vision (Meta): Dedicated vision-language safety classifier
GuardReasoner-VL (NeurIPS'25): Reasoning-based guard with reinforced chain-of-thought
LLavaGuard (ICML'25): VLM-based dataset curation and safety assessment
VLMGuard: Unlabeled data-based defense against malicious prompts
UniGuard: Universal safety guardrail across modalities

4.4 Evaluation Benchmarks

MM-SafetyBench (ECCV'24): Multimodal safety evaluation benchmark
JailBreakV-28K (COLM'24): 28K multimodal jailbreak samples
MMJ-Bench: Comprehensive jailbreak evaluation for MLLMs
MLLMGuard: Multi-dimensional safety evaluation suite
MOSSBench (ICLR'25): Tests for oversensitivity to safe queries

ShieldX Layer Mapping — Defense Patterns

Defense Pattern	Method	ShieldX Layer	Module	Gap / Enhancement
Hidden state monitoring	HiddenDetect, RCS	L1 Detection (future L2)	`EntropyScanner.ts` → needs hidden-state hook	Gap: No hidden-state layer yet
Adaptive memory for attacks	JailDAM	L9 Learning	`EmbeddingStore.ts`, `PatternStore.ts`	Already partially implemented
Constitutional rules at inference	CoCA, AdaShield	L8 Validation	`IntentGuardValidator.ts`, `RoleIntegrityChecker.ts`	Could add constitutional rule set
Cross-modal consistency check	CIDER, MirrorCheck	L6 Behavioral	`ContextIntegrity.ts`	Extends to vision inputs
Guard model (dedicated classifier)	LLaMA Guard 3 Vision, GuardReasoner-VL	L1 Detection	`RuleEngine.ts` → could add LLM-guard integration	Ollama-based guard model possible
Reasoning-based safety	GuardReasoner, VLMGuard-R1	L1 Detection	Could add CoT safety evaluation via Ollama	Enhancement opportunity
Adversarial prompt blue-teaming	BlueSuffix, MART	L9 Learning	`RedTeamEngine.ts`, `ActiveLearner.ts`	Already designed for this
Input-to-text conversion (visual)	ECSO	L0 Preprocessing	Would need vision-to-text preprocessing hook	Future vision support
Robust vision encoder	Robust CLIP, Sim-CLIP	L9 Supply Chain	`ModelProvenanceChecker.ts`	Could verify encoder provenance
Unlearning harmful knowledge	Machine Unlearning	L9 Learning	Not implemented — research item	Gap

5. ShieldX Layer-by-Layer Integration Summary

ShieldX's current 10-layer pipeline and how the research maps to each:

Layer	Name	Current Modules	Research Enhancements from sarendis56
L0	Preprocessing	`UnicodeNormalizer`, `TokenizerNormalizer`, `CompressedPayloadDetector`	Add low-resource language normalization; cipher/encoding detection (ArtPrompt, FlipAttack patterns)
L1	Rule-based Detection	`RuleEngine`, `EntropyScanner`, `UnicodeScanner`	Add GCG suffix entropy patterns; DAN/DeepInception rule templates; typographic prompt patterns (FigStep)
L2	Semantic Layer	(EmbeddingStore in learning)	Priority gap: Add RCS-style hidden-state contrastive scoring for jailbreak detection
L3	Classification	(via RuleEngine + behavioral)	Integrate GuardReasoner-style CoT classification via Ollama LLM guard call
L4	Compliance	`ATLASMapper`, `OWASPMapper`, `EUAIActReporter`	Map new attack types to MITRE ATLAS; add JailBreakV-28K as test suite
L5	Sanitization	`InputSanitizer`, `OutputSanitizer`, `SpotlightingEncoder`	Add vision-space canary injection for LVLM inputs; delimiter hardening against structural attacks
L6	Behavioral	`ConversationTracker`, `IntentMonitor`, `ContextDriftDetector`, `KillChainMapper`	Add multi-turn escalation detection (Crescendo, Jigsaw, Foot-in-Door patterns); attention-shift detection
L7	MCP Guard	`PrivilegeChecker`, `ToolChainGuard`, `ResourceGovernor`, `ToolPoisonDetector`	Add Agent Smith multi-agent viral spread detection; resource exhaustion from Verbose Images attack class
L8	Validation	`RAGShield`, `ScopeValidator`, `IntentGuardValidator`, `LeakageDetector`	Add RAG poison detection (Pandora, UnleashingWorms patterns); cross-modal consistency check (CIDER)
L9	Learning / Supply Chain	`PatternEvolver`, `RedTeamEngine`, `ActiveLearner`, `SupplyChainVerifier`	Feed JailBreakV-28K, MM-SafetyBench into PatternEvolver; add backdoor/trojan model detection (TrojVLM)

6. Priority Action Items for ShieldX

High Priority

Hidden-State Layer (L2): The RCS paper (this exact repo) demonstrates that surface-text detection misses many jailbreaks. ShieldX needs an embedding/hidden-state analysis layer. Implement via EmbeddingStore.ts + pgvector similarity search using known-harmful representation clusters.
Multi-turn Escalation Detection (L6): Crescendo, Jigsaw Puzzles, and Foot-in-the-Door are proven against production systems. ConversationTracker.ts needs escalation-pattern scoring across session turns, not just per-message analysis.
Cipher/Encoding Preprocessor (L0): FlipAttack, ArtPrompt, CodeChameleon, CipherChat all bypass text-level rules. TokenizerNormalizer.ts should add cipher detection and normalization.

Medium Priority

RAG Poison Shield Enhancement (L8): RAGShield.ts should include retrieval-result anomaly scoring based on Pandora and UnleashingWorms patterns.
GuardReasoner-style CoT Check (L3): Add an optional Ollama-based reasoning guard step that evaluates intent via chain-of-thought before allowing high-risk operations.
Agent Smith Pattern (L7): ToolChainGuard.ts should detect exponential replication patterns in multi-agent tool calls — a key emerging threat.

Research / Future

Vision Input Support: ECSO, RCS, and CIDER all address multimodal inputs. If ShieldX expands to guard vision-language agents, these are the starting points.
Machine Unlearning Integration: Not currently in ShieldX — would allow removal of specific harmful patterns without retraining the guard model.

7. Key Papers to Read

Paper	Why	arXiv
RCS (Jailbreak_Detection_RCS)	Core detection method, directly integrable	2512.12069
HiddenDetect (ACL'25)	Best prior work on hidden-state detection	2502.14744
Agent Smith (ICML'24)	Multi-agent viral spread — critical for agentic ShieldX	2402.08567
GCG (Universal Adversarial Attacks)	Foundational white-box attack, defines entropy patterns	2307.15043
Crescendo (Microsoft Azure)	Multi-turn escalation — most realistic production threat	2404.01833
GuardReasoner (ICLR Workshop'25)	Best current reasoning-based guard	2501.18492
JailBreakV-28K (COLM'24)	Primary evaluation benchmark for multimodal	2404.03027
FlipAttack (ICML'25)	Trivially bypasses keyword detection — should be in L0 test suite	2410.02832
SMOOTHLLM	Randomized smoothing defense — certifiable robustness	2310.03684
PAIR (NeurIPS'24)	Automated red teaming — maps to `RedTeamEngine.ts`	2310.08419

Reference created: 2026-04-04 Source repos: /Users/renefichtmueller/Desktop/Claude Code/github-repos/Jailbreak_Detection_RCS, Awesome-Jailbreak-on-LLMs, Awesome-LVLM-Attack, Awesome-LVLM-Safety ShieldX path: /Users/renefichtmueller/shieldx/

20 KiB Raw Blame History