shieldx/docs/self-evolution.md
Rene Fichtmueller a3793a1357 feat: ShieldX v0.1.0 — Self-Evolving LLM Prompt Injection Defense
10-layer defense pipeline with kill chain mapping, self-healing,
self-learning, and compliance reporting. Local-first, zero cloud deps.

- 72 detection rules across 7 kill chain phases
- 294 unit tests, 500+ attack corpus samples
- Management dashboard (Next.js 15, 10 pages)
- Automated resistance testing (2x daily, 31 probes)
- MITRE ATLAS, OWASP LLM Top 10, EU AI Act compliance
- Integrations: Next.js middleware, Ollama, n8n
- PostgreSQL 17 + pgvector for persistent learning
2026-03-27 15:07:27 +13:00

13 KiB

Self-Evolution Engine

Overview

ShieldX models its self-learning system on biological immune systems. The defense evolves continuously without manual rule updates. Five mechanisms work together: innate immunity (static rules), adaptive immunity (ML classifiers), immune memory (vector database), antibody generation (GAN red team), and herd immunity (federated sync).

All evolution happens locally by default. No data leaves your infrastructure unless you explicitly enable community sync.

Architecture

                    New Scan Results
                          |
            +-------------+-------------+
            |             |             |
            v             v             v
     [Feedback       [Drift         [Attack
      Processor]      Detector]      Graph]
            |             |             |
            v             v             v
     [Active         [Threshold     [Pattern
      Learner]        Adaptor]       Evolver]
            |             |             |
            +------+------+------+-----+
                   |             |
                   v             v
            [Pattern Store]  [Embedding Store]
                   |             |
                   +------+------+
                          |
                   +------+------+
                   |             |
                   v             v
            [Red Team       [Federated
             Engine]         Sync]

1. Innate Immunity (Static Rules)

Concept

Like the body's innate immune system (skin, mucous membranes, white blood cells), innate immunity provides immediate, non-specific defense against known threats. These rules are present from installation and never change at runtime.

Implementation

The RuleEngine loads 500+ regex patterns from the seed database. These patterns are organized by:

  • Kill chain phase: each pattern maps to a specific phase
  • Severity: default threat level for the pattern
  • Category: injection type (role override, delimiter manipulation, encoding trick, etc.)

Patterns are compiled once at initialization and evaluated sequentially with short-circuit on first critical match.

Characteristics

Property Value
Latency <2ms for 500+ patterns
False positive rate Low (patterns are precise, not probabilistic)
Evasion resistance Low (attackers can rephrase to avoid regex)
Update mechanism Seed script (npm run db:seed)

Strengths and Limitations

Strengths:

  • Zero latency overhead
  • Deterministic, auditable, explainable
  • No external dependencies
  • Catches the majority of unsophisticated attacks

Limitations:

  • Cannot detect novel or paraphrased attacks
  • Regex patterns are brittle against encoding tricks (handled by L0 preprocessing)
  • Cannot capture semantic meaning

2. Adaptive Immunity (ML Classifiers)

Concept

Like T-cells and B-cells that learn to recognize specific pathogens, adaptive immunity develops targeted defenses against attacks that bypass static rules. These classifiers improve over time through exposure to new attack patterns and feedback.

Implementation

Sentinel Classifier (L2): Binary classifier trained on labeled examples of benign and malicious prompts. Outputs a confidence score that maps to threat levels via configurable thresholds.

Active Learner (src/learning/ActiveLearner.ts): Identifies samples near the classifier's decision boundary -- inputs where the model is most uncertain. These samples are the most valuable for improving the classifier and are prioritized for human review.

Feedback Processor (src/learning/FeedbackProcessor.ts): Processes submitFeedback() calls to refine classifier weights. True positives reinforce existing patterns. False positives adjust the decision boundary to avoid future misclassification.

Threshold Adaptor (src/learning/ThresholdAdaptor.ts): Dynamically adjusts confidence thresholds based on observed false positive and false negative rates. If the false positive rate exceeds a configurable target, thresholds are raised. If the false negative rate increases (detected through red team testing), thresholds are lowered.

Learning Loop

User Input -> Scan Pipeline -> ShieldXResult
                                    |
                              User Feedback
                              (true/false positive)
                                    |
                            Feedback Processor
                                    |
                    +---------------+---------------+
                    |               |               |
              Pattern Store   Classifier Weights  Thresholds
              (new/refined    (retrained on       (adjusted by
               patterns)       feedback)           ThresholdAdaptor)

Characteristics

Property Value
Latency <10ms per classification
False positive rate Adaptive (adjusts via feedback)
Evasion resistance Medium (learns from confirmed attacks)
Update mechanism Continuous via feedback loop

3. Immune Memory (Vector Database)

Concept

Like immunological memory that enables faster response to previously encountered pathogens, the embedding store provides long-term memory of every attack pattern ShieldX has seen. New inputs are compared against this memory for semantic similarity, catching paraphrased variants.

Implementation

Embedding Store (src/learning/EmbeddingStore.ts): Stores attack pattern embeddings in PostgreSQL with pgvector. Each embedding is associated with its kill chain phase, severity, scanner origin, and confirmation status.

Semantic Similarity: New inputs are embedded (via Ollama) and compared against stored attack vectors using cosine similarity. A match above the configured threshold triggers detection even if no regex pattern or classifier fires.

Conversation Learner (src/learning/ConversationLearner.ts): Learns from conversation-level attack patterns -- multi-turn sequences that individually appear benign but collectively form an attack. Stores conversation fingerprints, not individual messages.

Storage Schema

Pattern Record:
  id: string
  embedding: float[] (pgvector)
  killChainPhase: KillChainPhase
  severity: ThreatLevel
  source: 'builtin' | 'learned' | 'community' | 'red_team'
  confirmedBy: 'human' | 'classifier' | 'red_team' | null
  createdAt: timestamp
  lastMatchedAt: timestamp
  matchCount: number
  falsePositiveCount: number

Characteristics

Property Value
Latency <200ms (embedding generation + similarity search)
False positive rate Medium (semantic similarity can match unrelated content)
Evasion resistance High (semantic meaning is preserved across paraphrasing)
Update mechanism Continuous -- new confirmed patterns added automatically

4. Antibody Generation (GAN Red Team)

Concept

Like the immune system generating antibodies to neutralize specific pathogens, the red team engine proactively generates new attack variants to test the defense pipeline before real attackers discover those variants.

Implementation

Red Team Engine (src/learning/RedTeamEngine.ts): Takes known attack patterns and generates variants using adversarial mutation strategies:

Mutation Strategy Description
Synonym replacement Replaces key terms with synonyms that preserve attack intent
Encoding shift Re-encodes payloads using different encoding schemes
Structural rearrangement Changes the order of injection components
Delimiter mutation Uses different delimiter styles
Language mixing Introduces multilingual elements
Token splitting Splits critical words across token boundaries
Homoglyph substitution Replaces characters with visually similar Unicode variants
Case manipulation Changes capitalization patterns

Pattern Evolver (src/learning/PatternEvolver.ts): Orchestrates the red team process:

  1. Select a set of known attack patterns from the pattern store
  2. Generate N variants per pattern using mutation strategies
  3. Run each variant through the full ShieldX pipeline
  4. Variants that bypass detection are flagged as "gap patterns"
  5. Gap patterns are added to the pattern store with source 'red_team'
  6. The rule engine and classifiers are updated to detect the new patterns

Red Team Cycle

Known Patterns --> [Mutation Engine] --> Variant Attacks
                                              |
                                     [ShieldX Pipeline]
                                              |
                                  +--------+--------+
                                  |                 |
                              Detected          Bypassed
                              (good)            (gap found!)
                                                    |
                                            [Pattern Store]
                                            [Classifier Update]
                                            [Embedding Store]

Self-Test

The npm run self-test command executes a full red team cycle against the current pipeline and reports:

  • Total variants generated
  • Attack success rate (ASR) -- percentage that bypassed detection
  • New gap patterns discovered
  • Pipeline coverage improvement after adding gap patterns

Characteristics

Property Value
Execution frequency Configurable (default: weekly batch, or on-demand)
Variants per pattern Configurable (default: 50)
Gap discovery rate Varies (typically 5-15% of variants bypass detection)
Update mechanism Automatic -- gap patterns added to stores immediately

5. Herd Immunity (Federated Sync)

Concept

Like herd immunity in a population, where widespread vaccination protects even unvaccinated individuals, federated sync allows ShieldX instances to share anonymized pattern intelligence. An attack detected by one deployment strengthens all others.

Implementation

Federated Sync (src/learning/FederatedSync.ts): Manages bidirectional sync with the community endpoint.

What is Shared

Data Shared Format
Attack pattern hash Yes SHA-256 of normalized pattern
Kill chain phase Yes Phase enum value
Severity level Yes Threat level enum value
Scanner type Yes Scanner ID that detected it
Confidence score Yes Anonymized (rounded to 0.1)
Pattern category Yes Category tag
Raw user input NEVER Not transmitted
Session ID NEVER Not transmitted
User ID NEVER Not transmitted
System prompt NEVER Not transmitted
IP address NEVER Not transmitted
Conversation context NEVER Not transmitted

Sync Protocol

  1. Push: After a pattern is confirmed (via feedback or red team), a sync record is created containing only the hash, phase, severity, and category. This is sent to the community endpoint.

  2. Pull: Periodically (configurable interval), the instance fetches new community patterns. These are stored with source 'community' and require local confirmation before they affect detection thresholds.

  3. Conflict Resolution: If a local pattern conflicts with a community pattern (different severity or phase), the local classification takes precedence. Community patterns serve as additional signals, not overrides.

Enabling Community Sync

const shield = new ShieldX({
  learning: {
    communitySync: true,
    communitySyncUrl: 'https://sync.shieldx.dev/v1/patterns',
  },
})

Characteristics

Property Value
Default state Disabled (opt-in only)
Sync interval Configurable (default: 1 hour)
Data transmitted Hashes and metadata only
Privacy guarantee No raw input ever leaves the instance

Supporting Components

Drift Detector

Module: src/learning/DriftDetector.ts

Monitors the distribution of detected attack patterns over time. Detects concept drift -- when the nature of attacks changes and existing patterns become less effective.

Drift indicators:

  • Rising false negative rate (detected through red team testing)
  • Shift in kill chain phase distribution
  • New scanner types triggering that previously did not
  • Declining confidence scores for existing patterns

When drift is detected, the DriftReport triggers:

  • Increased red team frequency
  • Threshold recalibration
  • Active learning sample prioritization
  • Alert to operators

Attack Graph

Module: src/learning/AttackGraph.ts

Builds a directed graph of attack patterns and their relationships. Nodes represent individual attack patterns. Edges represent observed progressions (e.g., an initial_access pattern followed by privilege_escalation).

The graph enables:

  • Predictive detection: if phase 1 of a known attack chain is detected, pre-emptively guard against the expected phase 2
  • Attack campaign identification: correlate related attacks across sessions
  • Pattern clustering: identify families of related attack techniques

Evolution Metrics

The getStats() method on the ShieldX instance returns LearningStats:

interface LearningStats {
  totalPatterns: number
  builtinPatterns: number
  learnedPatterns: number
  communityPatterns: number
  redTeamPatterns: number
  totalIncidents: number
  falsePositiveRate: number
  topPatterns: string[]
  recentIncidents: number
  driftDetected: boolean
}

These metrics provide visibility into the evolution engine's state and effectiveness.