Implements hardening based on sarendis56/Jailbreak_Detection_RCS (arXiv:2512.12069) and the Awesome-LVLM-Attack/Safety survey series. L0 — CipherDecoder: FlipAttack, ROT13, Caesar (all 25 shifts), Morse, Leet speak, Pig Latin, ASCII art detection with suspicion scoring. L2 — SemanticContrastiveScanner: RCS-style harmful/benign bucket comparison via EmbeddingStore, 20 canonical jailbreak seeds, BoW embedding fallback for offline use. L6 — ConversationTracker: Crescendo (+0.35), Foot-in-the-Door (+0.40), Jigsaw Puzzle (+0.45) multi-turn escalation patterns added. 292/294 tests passing (2 pre-existing ATLASMapper failures unrelated).
277 lines
20 KiB
Markdown
277 lines
20 KiB
Markdown
# sarendis56 Jailbreak Research Reference
|
|
|
|
> Cloned: 2026-04-04
|
|
> Sources: github.com/sarendis56/{Jailbreak_Detection_RCS, Awesome-Jailbreak-on-LLMs, Awesome-LVLM-Attack, Awesome-LVLM-Safety}
|
|
> Purpose: Map external LLM security research to ShieldX's 10-layer defense pipeline.
|
|
|
|
---
|
|
|
|
## 1. Jailbreak_Detection_RCS — Detection Approach
|
|
|
|
**Paper:** "Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring"
|
|
**arXiv:** 2512.12069 | WashU + Texas A&M | Dec 2025
|
|
|
|
### Core Method: Representational Contrastive Scoring (RCS)
|
|
|
|
The method operates on **internal hidden-state representations** of vision-language models rather than on surface-level text patterns. Two primary algorithms are implemented:
|
|
|
|
| Script | Method | Description |
|
|
|--------|--------|-------------|
|
|
| `code/kcd.py` | KCD (Key-layer Contrastive Difference) | Extracts hidden states at key layers and computes a contrastive score between safe and harmful representations |
|
|
| `code/mcd.py` | MCD (Multi-layer Contrastive Difference) | Aggregates contrastive signals across multiple transformer layers |
|
|
| `code/hidden_detect_*.py` | HiddenDetect baseline | Replication of ACL 2025 HiddenDetect — uses hidden state monitoring with layer-selection heuristics |
|
|
| `code/baseline_flava.py` | FLAVA baseline | Facebook multimodal model used as embedding-space comparison baseline |
|
|
|
|
### Key Technical Insights
|
|
|
|
1. **Layer selection matters**: Not all transformer layers carry equal jailbreak signal. KCD/MCD use heuristics to identify "safety-critical" layers (separate from token prediction layers).
|
|
2. **Contrastive scoring**: Instead of classifying a single embedding, the method scores the *distance* between a prompt's representation and a reference set of known-safe vs. known-harmful examples. Higher contrast = higher jailbreak probability.
|
|
3. **Model-agnostic structure**: Supports LLaVA-v1.6, Qwen2.5-VL (3B/7B), and InternVL3-8B — the feature extractor is swappable (`feature_extractor*.py`).
|
|
4. **Feature caching**: `feature_cache.py` avoids redundant forward passes — critical for production latency.
|
|
5. **Multi-run aggregation**: `run_multiple_experiments.py` runs experiments N times and aggregates — reduces statistical variance in detection scores.
|
|
|
|
### Datasets Used for Evaluation
|
|
- JailbreakV-28K (requires form request)
|
|
- Standard LVLM safety benchmarks
|
|
|
|
### ShieldX Integration Opportunity
|
|
This approach is directly applicable to ShieldX's **L1 (Rule Engine + Entropy Scanner)** layer for LLM self-evaluation and to a future **L2 (Semantic/Embedding Layer)** if ShieldX adds vision-language guard capabilities. The contrastive scoring logic could feed into `EmbeddingStore.ts` and `PatternEvolver.ts` in the learning module.
|
|
|
|
---
|
|
|
|
## 2. Awesome-LVLM-Attack — Key Attack Vectors
|
|
|
|
**Paper:** "A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends"
|
|
**arXiv:** 2407.07403 | IEEE TNNLS 2025
|
|
|
|
### Attack Taxonomy (4 Primary Categories)
|
|
|
|
#### 2.1 Adversarial Attacks (Gradient-based, Pixel-level)
|
|
- **Goal:** Craft imperceptible image perturbations that cause model misbehavior
|
|
- **Key methods:** GCG-visual, VLATTACK, InstructTA, OT-Attack, AnyAttack
|
|
- **Mechanism:** Optimize pixel deltas using cross-prompt transferability (CroPA approach — one perturbation works across many prompts)
|
|
- **ShieldX L0 relevance:** `CompressedPayloadDetector.ts` and `UnicodeNormalizer.ts` address text-space analogues; a vision layer would need pixel-space anomaly detection
|
|
|
|
#### 2.2 Jailbreak Attacks (Prompt-level, Semantic)
|
|
- **Typographic attacks (FigStep):** Embed harmful text inside images using typography — bypasses text-only filters since the content is visual, not textual
|
|
- **Role-playing via images (Visual-RolePlay):** Use images that depict personas/roles to bypass refusal
|
|
- **Bi-modal adversarial prompts (BAP):** Simultaneously attack image and text modalities
|
|
- **IDEATOR:** Uses the LVLM itself to generate jailbreak variations — self-attacking loop
|
|
- **Safe+Safe=Unsafe:** Compose multiple individually safe images to produce harmful output jointly
|
|
- **ImgTrojan:** Fine-tune model with a single poisoned image to create persistent backdoor
|
|
|
|
#### 2.3 Prompt Injection (Cross-modal)
|
|
- **Indirect instruction injection via image/audio:** Embed instructions in images that override system prompts (Bagdasaryan et al., Cornell Tech)
|
|
- **Cross-modal prompt injection (2025):** Use one modality to inject into another's attention pathway
|
|
- **Image Hijacks:** Adversarial images that control generative model behavior at inference
|
|
|
|
#### 2.4 Data Poisoning / Backdoor
|
|
- **Shadowcast:** Stealthy data poisoning against VLMs — poisons training data to insert backdoor
|
|
- **TrojVLM, VL-Trojan, BadToken:** Backdoor via trigger tokens in multimodal inputs
|
|
- **Agent Smith:** Single poisoned image jailbreaks 1 million multimodal agents exponentially (viral spreading via multi-agent memory)
|
|
- **Physical backdoor:** Real-world triggers (e.g. in autonomous driving scenarios)
|
|
|
|
### ShieldX Layer Mapping — Attack Vectors
|
|
|
|
| Attack Category | Specific Technique | ShieldX Layer | Module |
|
|
|-----------------|-------------------|---------------|--------|
|
|
| Adversarial image | CroPA cross-prompt transfer | L0 Preprocessing | `CompressedPayloadDetector.ts` |
|
|
| Typographic injection | FigStep, text-in-image | L1 Detection | `RuleEngine.ts` (pattern rules) |
|
|
| Role-play bypass | Visual-RolePlay, IDEATOR | L6 Behavioral | `IntentMonitor.ts`, `ConversationTracker.ts` |
|
|
| Bi-modal jailbreak | BAP | L1 + L6 | `RuleEngine.ts` + `ContextIntegrity.ts` |
|
|
| Prompt injection (indirect) | Image Hijacks, cross-modal | L7 MCP Guard | `ToolPoisonDetector.ts`, `PrivilegeChecker.ts` |
|
|
| Data poisoning/backdoor | Shadowcast, TrojVLM | L9 Supply Chain | `SupplyChainVerifier.ts`, `ModelProvenanceChecker.ts` |
|
|
| Multi-agent viral spread | Agent Smith | L7 MCP Guard | `ToolChainGuard.ts`, `ResourceGovernor.ts` |
|
|
| Resource exhaustion | Verbose Images (high-latency) | L7 MCP Guard | `ResourceGovernor.ts` |
|
|
| Jailbreak via composition | Safe+Safe=Unsafe | L6 Behavioral | `ContextIntegrity.ts` |
|
|
|
|
---
|
|
|
|
## 3. Awesome-Jailbreak-on-LLMs — Key Attack Vectors (Text LLMs)
|
|
|
|
**Papers:** GuardReasoner (arXiv 2501.18492), FlipAttack (ICML'25), GuardReasoner-VL (NeurIPS'25)
|
|
|
|
### Attack Taxonomy (Text-only LLMs)
|
|
|
|
#### 3.1 Black-box Attacks
|
|
- **FlipAttack (ICML'25):** Flip character order / words to bypass safety filters — trivially breaks keyword-based detection
|
|
- **StructTransform:** Convert queries to structured formats (JSON, tables, code) to bypass alignment
|
|
- **ArtPrompt (ACL'24):** ASCII art encoding of harmful content — bypasses text filters entirely
|
|
- **DAN / AutoDAN:** Role-play as "DAN" (Do Anything Now) — persistent persona override
|
|
- **Many-shot jailbreaking (Anthropic, 2024):** Provide many few-shot examples of compliance to override refusal
|
|
- **Crescendo:** Multi-turn escalation — starts benign, slowly escalates to harmful request
|
|
- **PAIR (NeurIPS'24):** LLM-generated jailbreak prompts in 20 queries via automated red teaming
|
|
- **CodeAttack (ACL'24):** Embed requests in code completion context
|
|
- **Virtual Context:** Special token injection to manipulate context window
|
|
- **Emoji Attack (ICML'25):** Use emojis to confuse classifier/judge LLMs
|
|
- **SQL Injection Jailbreak:** Structural attack exploiting SQL-like parsing in prompts
|
|
- **DeepInception (EMNLP'24):** Nested fictional scenarios ("you are in a story where...")
|
|
- **Cipher-based (CipherChat):** Encode harmful requests in ROT13, Base64, Morse, etc.
|
|
- **Low-resource language attacks:** Use obscure languages that have weaker safety alignment
|
|
|
|
#### 3.2 White-box Attacks
|
|
- **GCG (Universal and Transferable Adversarial Attacks):** Gradient-based suffix optimization — finds adversarial suffixes that transfer across models
|
|
- **AutoDAN (ICLR'24):** Stealthy GCG — generates human-readable jailbreak suffixes
|
|
- **Refusal Direction (arXiv'24):** "Refusal in LLMs is mediated by a single direction" — ablate that direction in activation space to disable refusal
|
|
|
|
#### 3.3 Multi-turn Attacks
|
|
- **Foot-in-the-Door:** Start with small compliant request, escalate gradually
|
|
- **Jigsaw Puzzles:** Split harmful question across multiple turns so no single turn triggers detection
|
|
- **Crescendo (Microsoft):** Multi-turn escalation via seeming-harmless steps
|
|
- **Attention Shifting:** Multi-turn manipulation of model attention to suppress refusal
|
|
|
|
#### 3.4 RAG-based Attacks
|
|
- **Pandora:** Poison retrieval database to inject adversarial context into RAG responses
|
|
- **UnleashingWorms:** Escalate RAG poisoning to extract data and spread to other agents
|
|
|
|
#### 3.5 Defense Methods Catalogued
|
|
- **GuardReasoner (ICLR Workshop'25):** Reasoning-based safeguards — chain-of-thought for safety decisions
|
|
- **LLaMA Guard 3, ShieldGemma, WildGuard:** Guard model approaches (dedicated classifier LLMs)
|
|
- **SMOOTHLLM:** Randomized smoothing — perturb input N times, aggregate decisions
|
|
- **Hidden State Filtering (HSF):** Monitor hidden states to detect anomalies before generation
|
|
- **GradSafe (ACL'24):** Safety-critical gradient analysis to detect unsafe prompts
|
|
- **SafeDecoding (ACL'24):** Safety-aware decoding — bias token generation toward safe tokens
|
|
- **Backtranslation defense:** Translate to another language and back to disrupt adversarial suffixes
|
|
- **PARDEN (ICML'24):** Repetition-based defense — ask model to repeat the query, check consistency
|
|
- **Intention Analysis (IA):** Classify intent before responding
|
|
- **Self-Reminder:** System prompt self-reminder about safety guidelines
|
|
|
|
### ShieldX Layer Mapping — Text Attack Vectors
|
|
|
|
| Attack Category | Specific Technique | ShieldX Layer | Module |
|
|
|-----------------|-------------------|---------------|--------|
|
|
| Character/encoding obfuscation | FlipAttack, ArtPrompt, Cipher | L0 Preprocessing | `UnicodeNormalizer.ts`, `TokenizerNormalizer.ts` |
|
|
| Structural encoding | StructTransform, CodeAttack, SQL Injection | L0 Preprocessing | `CompressedPayloadDetector.ts` |
|
|
| Keyword evasion (emoji) | Emoji Attack | L0 Preprocessing | `TokenizerNormalizer.ts` |
|
|
| Role-play / DAN | AutoDAN, DAN, DeepInception | L1 Detection | `RuleEngine.ts` (role-play rules) |
|
|
| Token injection | Virtual Context, Special Tokens | L1 Detection | `RuleEngine.ts`, `EntropyScanner.ts` |
|
|
| Many-shot / few-shot | Many-shot jailbreaking (MSJ) | L6 Behavioral | `ConversationTracker.ts`, `SessionProfiler.ts` |
|
|
| Multi-turn escalation | Crescendo, Foot-in-Door, Jigsaw | L6 Behavioral | `ConversationTracker.ts`, `ContextIntegrity.ts`, `AnomalyDetector.ts` |
|
|
| Gradient suffix (white-box) | GCG, AutoDAN, I-GCG | L1 Detection | `EntropyScanner.ts` (entropy spike) |
|
|
| RAG poisoning | Pandora, UnleashingWorms | L8 Validation | `RAGShield.ts`, `ScopeValidator.ts` |
|
|
| Attention shifting | Multi-turn attention manipulation | L6 Behavioral | `ContextDriftDetector.ts` |
|
|
| Refusal ablation | Single-direction refusal bypass | Future L2 | Needs hidden-state layer (see RCS above) |
|
|
| Low-resource language | Multilingual jailbreaks | L0 Preprocessing | `UnicodeNormalizer.ts` |
|
|
|
|
---
|
|
|
|
## 4. Awesome-LVLM-Safety — Key Defense Patterns
|
|
|
|
**Paper:** "A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations"
|
|
**arXiv:** 2502.14881
|
|
|
|
### Defense Taxonomy
|
|
|
|
#### 4.1 Training-Phase Defenses
|
|
- **Safety Fine-Tuning (VLGuard, SPA-VL):** Curate safety preference datasets, fine-tune with RLHF/DPO
|
|
- **Adversarial Training (ASTRA, DREAM):** Include adversarial examples in fine-tuning
|
|
- **Safe RLHF-V:** Multimodal extension of RLHF with explicit safety constraints
|
|
- **Machine Unlearning:** Remove harmful knowledge without full retraining (Single Image Unlearning)
|
|
- **Robust CLIP / Sim-CLIP:** Adversarially fine-tune vision encoder to resist perturbations
|
|
- **Backdoor Cleaning (2025 NeurIPS):** Remove backdoors without external guidance during fine-tuning
|
|
|
|
#### 4.2 Inference-Phase Defenses
|
|
- **ECSO (Eyes Closed, Safety On):** Convert image to text description before processing — removes adversarial visual features
|
|
- **AdaShield:** Adaptive shield prompting — dynamically inject safety prompts based on input structure
|
|
- **HiddenDetect (ACL'25):** Monitor hidden states at safety-critical layers during inference
|
|
- **RCS (this repo, arXiv 2512.12069):** Representational contrastive scoring for jailbreak detection
|
|
- **JailDAM (COLM'25):** Jailbreak detection with adaptive memory — stores representations of known attacks
|
|
- **MirrorCheck:** Adversarial defense via input mirroring and comparison
|
|
- **CIDER (EMNLP'24):** Cross-modality information check — verify consistency between image and text signals
|
|
- **PIP (MM'24):** Use attention patterns of irrelevant probe questions to detect adversarial inputs
|
|
- **ETA (ICLR'25):** Evaluate-then-align — runtime safety evaluation before generation
|
|
- **CoCA:** Constitutional calibration — realign safety-awareness at inference via constitutional rules
|
|
- **VLMGuard-R1 (2025):** Reasoning-driven prompt optimization for proactive safety
|
|
- **OmniGuard (2025):** Unified omni-modal guardrails with deliberate reasoning
|
|
- **InferAligner:** Cross-model guidance for harmlessness — use a reference safe model to steer generation
|
|
- **BlueSuffix (ICLR'25):** Adversarial blue-teaming — train model to be robust against jailbreaks
|
|
|
|
#### 4.3 Guard Models
|
|
- **LLaMA Guard 3 Vision (Meta):** Dedicated vision-language safety classifier
|
|
- **GuardReasoner-VL (NeurIPS'25):** Reasoning-based guard with reinforced chain-of-thought
|
|
- **LLavaGuard (ICML'25):** VLM-based dataset curation and safety assessment
|
|
- **VLMGuard:** Unlabeled data-based defense against malicious prompts
|
|
- **UniGuard:** Universal safety guardrail across modalities
|
|
|
|
#### 4.4 Evaluation Benchmarks
|
|
- **MM-SafetyBench (ECCV'24):** Multimodal safety evaluation benchmark
|
|
- **JailBreakV-28K (COLM'24):** 28K multimodal jailbreak samples
|
|
- **MMJ-Bench:** Comprehensive jailbreak evaluation for MLLMs
|
|
- **MLLMGuard:** Multi-dimensional safety evaluation suite
|
|
- **MOSSBench (ICLR'25):** Tests for oversensitivity to safe queries
|
|
|
|
### ShieldX Layer Mapping — Defense Patterns
|
|
|
|
| Defense Pattern | Method | ShieldX Layer | Module | Gap / Enhancement |
|
|
|-----------------|--------|---------------|--------|-------------------|
|
|
| Hidden state monitoring | HiddenDetect, RCS | L1 Detection (future L2) | `EntropyScanner.ts` → needs hidden-state hook | **Gap:** No hidden-state layer yet |
|
|
| Adaptive memory for attacks | JailDAM | L9 Learning | `EmbeddingStore.ts`, `PatternStore.ts` | Already partially implemented |
|
|
| Constitutional rules at inference | CoCA, AdaShield | L8 Validation | `IntentGuardValidator.ts`, `RoleIntegrityChecker.ts` | Could add constitutional rule set |
|
|
| Cross-modal consistency check | CIDER, MirrorCheck | L6 Behavioral | `ContextIntegrity.ts` | Extends to vision inputs |
|
|
| Guard model (dedicated classifier) | LLaMA Guard 3 Vision, GuardReasoner-VL | L1 Detection | `RuleEngine.ts` → could add LLM-guard integration | Ollama-based guard model possible |
|
|
| Reasoning-based safety | GuardReasoner, VLMGuard-R1 | L1 Detection | Could add CoT safety evaluation via Ollama | **Enhancement opportunity** |
|
|
| Adversarial prompt blue-teaming | BlueSuffix, MART | L9 Learning | `RedTeamEngine.ts`, `ActiveLearner.ts` | Already designed for this |
|
|
| Input-to-text conversion (visual) | ECSO | L0 Preprocessing | Would need vision-to-text preprocessing hook | Future vision support |
|
|
| Robust vision encoder | Robust CLIP, Sim-CLIP | L9 Supply Chain | `ModelProvenanceChecker.ts` | Could verify encoder provenance |
|
|
| Unlearning harmful knowledge | Machine Unlearning | L9 Learning | Not implemented — research item | **Gap** |
|
|
|
|
---
|
|
|
|
## 5. ShieldX Layer-by-Layer Integration Summary
|
|
|
|
ShieldX's current 10-layer pipeline and how the research maps to each:
|
|
|
|
| Layer | Name | Current Modules | Research Enhancements from sarendis56 |
|
|
|-------|------|-----------------|---------------------------------------|
|
|
| **L0** | Preprocessing | `UnicodeNormalizer`, `TokenizerNormalizer`, `CompressedPayloadDetector` | Add low-resource language normalization; cipher/encoding detection (ArtPrompt, FlipAttack patterns) |
|
|
| **L1** | Rule-based Detection | `RuleEngine`, `EntropyScanner`, `UnicodeScanner` | Add GCG suffix entropy patterns; DAN/DeepInception rule templates; typographic prompt patterns (FigStep) |
|
|
| **L2** | Semantic Layer | (EmbeddingStore in learning) | **Priority gap:** Add RCS-style hidden-state contrastive scoring for jailbreak detection |
|
|
| **L3** | Classification | (via RuleEngine + behavioral) | Integrate GuardReasoner-style CoT classification via Ollama LLM guard call |
|
|
| **L4** | Compliance | `ATLASMapper`, `OWASPMapper`, `EUAIActReporter` | Map new attack types to MITRE ATLAS; add JailBreakV-28K as test suite |
|
|
| **L5** | Sanitization | `InputSanitizer`, `OutputSanitizer`, `SpotlightingEncoder` | Add vision-space canary injection for LVLM inputs; delimiter hardening against structural attacks |
|
|
| **L6** | Behavioral | `ConversationTracker`, `IntentMonitor`, `ContextDriftDetector`, `KillChainMapper` | Add multi-turn escalation detection (Crescendo, Jigsaw, Foot-in-Door patterns); attention-shift detection |
|
|
| **L7** | MCP Guard | `PrivilegeChecker`, `ToolChainGuard`, `ResourceGovernor`, `ToolPoisonDetector` | Add Agent Smith multi-agent viral spread detection; resource exhaustion from Verbose Images attack class |
|
|
| **L8** | Validation | `RAGShield`, `ScopeValidator`, `IntentGuardValidator`, `LeakageDetector` | Add RAG poison detection (Pandora, UnleashingWorms patterns); cross-modal consistency check (CIDER) |
|
|
| **L9** | Learning / Supply Chain | `PatternEvolver`, `RedTeamEngine`, `ActiveLearner`, `SupplyChainVerifier` | Feed JailBreakV-28K, MM-SafetyBench into PatternEvolver; add backdoor/trojan model detection (TrojVLM) |
|
|
|
|
---
|
|
|
|
## 6. Priority Action Items for ShieldX
|
|
|
|
### High Priority
|
|
1. **Hidden-State Layer (L2):** The RCS paper (this exact repo) demonstrates that surface-text detection misses many jailbreaks. ShieldX needs an embedding/hidden-state analysis layer. Implement via `EmbeddingStore.ts` + pgvector similarity search using known-harmful representation clusters.
|
|
2. **Multi-turn Escalation Detection (L6):** Crescendo, Jigsaw Puzzles, and Foot-in-the-Door are proven against production systems. `ConversationTracker.ts` needs escalation-pattern scoring across session turns, not just per-message analysis.
|
|
3. **Cipher/Encoding Preprocessor (L0):** FlipAttack, ArtPrompt, CodeChameleon, CipherChat all bypass text-level rules. `TokenizerNormalizer.ts` should add cipher detection and normalization.
|
|
|
|
### Medium Priority
|
|
4. **RAG Poison Shield Enhancement (L8):** `RAGShield.ts` should include retrieval-result anomaly scoring based on Pandora and UnleashingWorms patterns.
|
|
5. **GuardReasoner-style CoT Check (L3):** Add an optional Ollama-based reasoning guard step that evaluates intent via chain-of-thought before allowing high-risk operations.
|
|
6. **Agent Smith Pattern (L7):** `ToolChainGuard.ts` should detect exponential replication patterns in multi-agent tool calls — a key emerging threat.
|
|
|
|
### Research / Future
|
|
7. **Vision Input Support:** ECSO, RCS, and CIDER all address multimodal inputs. If ShieldX expands to guard vision-language agents, these are the starting points.
|
|
8. **Machine Unlearning Integration:** Not currently in ShieldX — would allow removal of specific harmful patterns without retraining the guard model.
|
|
|
|
---
|
|
|
|
## 7. Key Papers to Read
|
|
|
|
| Paper | Why | arXiv |
|
|
|-------|-----|-------|
|
|
| RCS (Jailbreak_Detection_RCS) | Core detection method, directly integrable | 2512.12069 |
|
|
| HiddenDetect (ACL'25) | Best prior work on hidden-state detection | 2502.14744 |
|
|
| Agent Smith (ICML'24) | Multi-agent viral spread — critical for agentic ShieldX | 2402.08567 |
|
|
| GCG (Universal Adversarial Attacks) | Foundational white-box attack, defines entropy patterns | 2307.15043 |
|
|
| Crescendo (Microsoft Azure) | Multi-turn escalation — most realistic production threat | 2404.01833 |
|
|
| GuardReasoner (ICLR Workshop'25) | Best current reasoning-based guard | 2501.18492 |
|
|
| JailBreakV-28K (COLM'24) | Primary evaluation benchmark for multimodal | 2404.03027 |
|
|
| FlipAttack (ICML'25) | Trivially bypasses keyword detection — should be in L0 test suite | 2410.02832 |
|
|
| SMOOTHLLM | Randomized smoothing defense — certifiable robustness | 2310.03684 |
|
|
| PAIR (NeurIPS'24) | Automated red teaming — maps to `RedTeamEngine.ts` | 2310.08419 |
|
|
|
|
---
|
|
|
|
*Reference created: 2026-04-04*
|
|
*Source repos: /Users/renefichtmueller/Desktop/Claude Code/github-repos/Jailbreak_Detection_RCS, Awesome-Jailbreak-on-LLMs, Awesome-LVLM-Attack, Awesome-LVLM-Safety*
|
|
*ShieldX path: /Users/renefichtmueller/shieldx/*
|