Rene Fichtmueller a3793a1357 feat: ShieldX v0.1.0 — Self-Evolving LLM Prompt Injection Defense

10-layer defense pipeline with kill chain mapping, self-healing,
self-learning, and compliance reporting. Local-first, zero cloud deps.

- 72 detection rules across 7 kill chain phases
- 294 unit tests, 500+ attack corpus samples
- Management dashboard (Next.js 15, 10 pages)
- Automated resistance testing (2x daily, 31 probes)
- MITRE ATLAS, OWASP LLM Top 10, EU AI Act compliance
- Integrations: Next.js middleware, Ollama, n8n
- PostgreSQL 17 + pgvector for persistent learning

2026-03-27 15:07:27 +13:00

13 KiB

Raw Blame History

Self-Evolution Engine

Overview

ShieldX models its self-learning system on biological immune systems. The defense evolves continuously without manual rule updates. Five mechanisms work together: innate immunity (static rules), adaptive immunity (ML classifiers), immune memory (vector database), antibody generation (GAN red team), and herd immunity (federated sync).

All evolution happens locally by default. No data leaves your infrastructure unless you explicitly enable community sync.

Architecture

                    New Scan Results
                          |
            +-------------+-------------+
            |             |             |
            v             v             v
     [Feedback       [Drift         [Attack
      Processor]      Detector]      Graph]
            |             |             |
            v             v             v
     [Active         [Threshold     [Pattern
      Learner]        Adaptor]       Evolver]
            |             |             |
            +------+------+------+-----+
                   |             |
                   v             v
            [Pattern Store]  [Embedding Store]
                   |             |
                   +------+------+
                          |
                   +------+------+
                   |             |
                   v             v
            [Red Team       [Federated
             Engine]         Sync]

1. Innate Immunity (Static Rules)

Concept

Like the body's innate immune system (skin, mucous membranes, white blood cells), innate immunity provides immediate, non-specific defense against known threats. These rules are present from installation and never change at runtime.

Implementation

The RuleEngine loads 500+ regex patterns from the seed database. These patterns are organized by:

Kill chain phase: each pattern maps to a specific phase
Severity: default threat level for the pattern
Category: injection type (role override, delimiter manipulation, encoding trick, etc.)

Patterns are compiled once at initialization and evaluated sequentially with short-circuit on first critical match.

Characteristics

Property	Value
Latency	<2ms for 500+ patterns
False positive rate	Low (patterns are precise, not probabilistic)
Evasion resistance	Low (attackers can rephrase to avoid regex)
Update mechanism	Seed script (`npm run db:seed`)

Strengths and Limitations

Strengths:

Zero latency overhead
Deterministic, auditable, explainable
No external dependencies
Catches the majority of unsophisticated attacks

Limitations:

Cannot detect novel or paraphrased attacks
Regex patterns are brittle against encoding tricks (handled by L0 preprocessing)
Cannot capture semantic meaning

2. Adaptive Immunity (ML Classifiers)

Concept

Like T-cells and B-cells that learn to recognize specific pathogens, adaptive immunity develops targeted defenses against attacks that bypass static rules. These classifiers improve over time through exposure to new attack patterns and feedback.

Implementation

Sentinel Classifier (L2): Binary classifier trained on labeled examples of benign and malicious prompts. Outputs a confidence score that maps to threat levels via configurable thresholds.

Active Learner (src/learning/ActiveLearner.ts): Identifies samples near the classifier's decision boundary -- inputs where the model is most uncertain. These samples are the most valuable for improving the classifier and are prioritized for human review.

Feedback Processor (src/learning/FeedbackProcessor.ts): Processes submitFeedback() calls to refine classifier weights. True positives reinforce existing patterns. False positives adjust the decision boundary to avoid future misclassification.

Threshold Adaptor (src/learning/ThresholdAdaptor.ts): Dynamically adjusts confidence thresholds based on observed false positive and false negative rates. If the false positive rate exceeds a configurable target, thresholds are raised. If the false negative rate increases (detected through red team testing), thresholds are lowered.

Learning Loop

User Input -> Scan Pipeline -> ShieldXResult
                                    |
                              User Feedback
                              (true/false positive)
                                    |
                            Feedback Processor
                                    |
                    +---------------+---------------+
                    |               |               |
              Pattern Store   Classifier Weights  Thresholds
              (new/refined    (retrained on       (adjusted by
               patterns)       feedback)           ThresholdAdaptor)

Characteristics

Property	Value
Latency	<10ms per classification
False positive rate	Adaptive (adjusts via feedback)
Evasion resistance	Medium (learns from confirmed attacks)
Update mechanism	Continuous via feedback loop

3. Immune Memory (Vector Database)

Concept

Like immunological memory that enables faster response to previously encountered pathogens, the embedding store provides long-term memory of every attack pattern ShieldX has seen. New inputs are compared against this memory for semantic similarity, catching paraphrased variants.

Implementation

Embedding Store (src/learning/EmbeddingStore.ts): Stores attack pattern embeddings in PostgreSQL with pgvector. Each embedding is associated with its kill chain phase, severity, scanner origin, and confirmation status.

Semantic Similarity: New inputs are embedded (via Ollama) and compared against stored attack vectors using cosine similarity. A match above the configured threshold triggers detection even if no regex pattern or classifier fires.

Conversation Learner (src/learning/ConversationLearner.ts): Learns from conversation-level attack patterns -- multi-turn sequences that individually appear benign but collectively form an attack. Stores conversation fingerprints, not individual messages.

Storage Schema

Pattern Record:
  id: string
  embedding: float[] (pgvector)
  killChainPhase: KillChainPhase
  severity: ThreatLevel
  source: 'builtin' | 'learned' | 'community' | 'red_team'
  confirmedBy: 'human' | 'classifier' | 'red_team' | null
  createdAt: timestamp
  lastMatchedAt: timestamp
  matchCount: number
  falsePositiveCount: number

Characteristics

Property	Value
Latency	<200ms (embedding generation + similarity search)
False positive rate	Medium (semantic similarity can match unrelated content)
Evasion resistance	High (semantic meaning is preserved across paraphrasing)
Update mechanism	Continuous -- new confirmed patterns added automatically

4. Antibody Generation (GAN Red Team)

Concept

Like the immune system generating antibodies to neutralize specific pathogens, the red team engine proactively generates new attack variants to test the defense pipeline before real attackers discover those variants.

Implementation

Red Team Engine (src/learning/RedTeamEngine.ts): Takes known attack patterns and generates variants using adversarial mutation strategies:

Mutation Strategy	Description
Synonym replacement	Replaces key terms with synonyms that preserve attack intent
Encoding shift	Re-encodes payloads using different encoding schemes
Structural rearrangement	Changes the order of injection components
Delimiter mutation	Uses different delimiter styles
Language mixing	Introduces multilingual elements
Token splitting	Splits critical words across token boundaries
Homoglyph substitution	Replaces characters with visually similar Unicode variants
Case manipulation	Changes capitalization patterns

Pattern Evolver (src/learning/PatternEvolver.ts): Orchestrates the red team process:

Select a set of known attack patterns from the pattern store
Generate N variants per pattern using mutation strategies
Run each variant through the full ShieldX pipeline
Variants that bypass detection are flagged as "gap patterns"
Gap patterns are added to the pattern store with source 'red_team'
The rule engine and classifiers are updated to detect the new patterns

Red Team Cycle

Known Patterns --> [Mutation Engine] --> Variant Attacks
                                              |
                                     [ShieldX Pipeline]
                                              |
                                  +--------+--------+
                                  |                 |
                              Detected          Bypassed
                              (good)            (gap found!)
                                                    |
                                            [Pattern Store]
                                            [Classifier Update]
                                            [Embedding Store]

Self-Test

The npm run self-test command executes a full red team cycle against the current pipeline and reports:

Total variants generated
Attack success rate (ASR) -- percentage that bypassed detection
New gap patterns discovered
Pipeline coverage improvement after adding gap patterns

Characteristics

Property	Value
Execution frequency	Configurable (default: weekly batch, or on-demand)
Variants per pattern	Configurable (default: 50)
Gap discovery rate	Varies (typically 5-15% of variants bypass detection)
Update mechanism	Automatic -- gap patterns added to stores immediately

5. Herd Immunity (Federated Sync)

Concept

Like herd immunity in a population, where widespread vaccination protects even unvaccinated individuals, federated sync allows ShieldX instances to share anonymized pattern intelligence. An attack detected by one deployment strengthens all others.

Implementation

Federated Sync (src/learning/FederatedSync.ts): Manages bidirectional sync with the community endpoint.

What is Shared

Data	Shared	Format
Attack pattern hash	Yes	SHA-256 of normalized pattern
Kill chain phase	Yes	Phase enum value
Severity level	Yes	Threat level enum value
Scanner type	Yes	Scanner ID that detected it
Confidence score	Yes	Anonymized (rounded to 0.1)
Pattern category	Yes	Category tag
Raw user input	NEVER	Not transmitted
Session ID	NEVER	Not transmitted
User ID	NEVER	Not transmitted
System prompt	NEVER	Not transmitted
IP address	NEVER	Not transmitted
Conversation context	NEVER	Not transmitted

Sync Protocol

Push: After a pattern is confirmed (via feedback or red team), a sync record is created containing only the hash, phase, severity, and category. This is sent to the community endpoint.
Pull: Periodically (configurable interval), the instance fetches new community patterns. These are stored with source 'community' and require local confirmation before they affect detection thresholds.
Conflict Resolution: If a local pattern conflicts with a community pattern (different severity or phase), the local classification takes precedence. Community patterns serve as additional signals, not overrides.

Enabling Community Sync

const shield = new ShieldX({
  learning: {
    communitySync: true,
    communitySyncUrl: 'https://sync.shieldx.dev/v1/patterns',
  },
})

Characteristics

Property	Value
Default state	Disabled (opt-in only)
Sync interval	Configurable (default: 1 hour)
Data transmitted	Hashes and metadata only
Privacy guarantee	No raw input ever leaves the instance

Supporting Components

Drift Detector

Module: src/learning/DriftDetector.ts

Monitors the distribution of detected attack patterns over time. Detects concept drift -- when the nature of attacks changes and existing patterns become less effective.

Drift indicators:

Rising false negative rate (detected through red team testing)
Shift in kill chain phase distribution
New scanner types triggering that previously did not
Declining confidence scores for existing patterns

When drift is detected, the DriftReport triggers:

Increased red team frequency
Threshold recalibration
Active learning sample prioritization
Alert to operators

Attack Graph

Module: src/learning/AttackGraph.ts

Builds a directed graph of attack patterns and their relationships. Nodes represent individual attack patterns. Edges represent observed progressions (e.g., an initial_access pattern followed by privilege_escalation).

The graph enables:

Predictive detection: if phase 1 of a known attack chain is detected, pre-emptively guard against the expected phase 2
Attack campaign identification: correlate related attacks across sessions
Pattern clustering: identify families of related attack techniques

Evolution Metrics

The getStats() method on the ShieldX instance returns LearningStats:

interface LearningStats {
  totalPatterns: number
  builtinPatterns: number
  learnedPatterns: number
  communityPatterns: number
  redTeamPatterns: number
  totalIncidents: number
  falsePositiveRate: number
  topPatterns: string[]
  recentIncidents: number
  driftDetected: boolean
}

These metrics provide visibility into the evolution engine's state and effectiveness.

13 KiB Raw Blame History

Self-Evolution Engine

Overview

Architecture

1. Innate Immunity (Static Rules)

Concept

Implementation

Characteristics

Strengths and Limitations

2. Adaptive Immunity (ML Classifiers)

Concept

Implementation

Learning Loop

Characteristics

3. Immune Memory (Vector Database)

Concept

Implementation

Storage Schema

Characteristics

4. Antibody Generation (GAN Red Team)

Concept

Implementation

Red Team Cycle

Self-Test

Characteristics

5. Herd Immunity (Federated Sync)

Concept

Implementation

What is Shared

Sync Protocol

Enabling Community Sync

Characteristics

Supporting Components

Drift Detector

Attack Graph

Evolution Metrics

13 KiB

Raw Blame History