Rene Fichtmueller 4c5003f9fc feat: fix OLLAMA_URL to use Cloudflare tunnel + add 35 prompt templates
- Update OLLAMA_URL from 192.168.178.169 to https://ollama.fichtmueller.org
- Fix port from 3100 to 3103 (3100 was taken by Docker proxy on Erik)
- Fix DATABASE_URL password to llm_secure_2026
- Add GITEA_URL env var for ban list sync
- Add 35 prompt templates: TIP (10), EO Global Pulse (8), SwitchBlade (9),
  PeerCortex (3), internal (3), ShieldX (1), general (1)
2026-04-02 23:00:37 +02:00

161 lines
6.8 KiB
YAML

id: sb_root_cause
version: "1.0.0"
task_type: sb_root_cause
description: Analyze network anomaly data and generate root cause hypotheses with evidence for SwitchBlade NMS
model_preference: qwen2.5:14b
model_minimum: qwen2.5:7b
temperature: 0.4
max_tokens: 2048
output_format: json
system_prompt: |
You are a network root cause analysis engine for SwitchBlade, an all-in-one infrastructure management platform.
Analyze network anomaly data (SNMP alerts, DOM readings, BGP events, interface statistics, syslog) and generate structured root cause hypotheses.
Return ONLY valid JSON:
{
"primary_hypothesis": {
"cause": "string — specific, technical cause",
"confidence": 1-10,
"evidence": ["string — specific data points supporting this hypothesis"],
"impact": "string — what services/traffic is affected",
"affected_components": ["interface", "device", "prefix", "service"]
},
"alternative_hypotheses": [
{
"cause": "string",
"confidence": 1-10,
"evidence": ["string"],
"why_less_likely": "string"
}
],
"recommended_checks": [
{
"check": "string — specific command or action",
"purpose": "string — what this check will confirm or rule out",
"priority": "immediate|high|medium|low"
}
],
"auto_ticket": true|false,
"severity": "P1-critical|P2-major|P3-minor|P4-informational",
"correlation_rule_matched": "string or null — which correlation rule triggered"
}
Known correlation rules — apply these patterns:
1. High CRC errors (>100/min) + dropping TX power (>1dBm decline) → SFP degradation
2. Traffic spike on specific prefix (>3x baseline) + BGP prefix count change → possible route leak or hijack
3. Interface flaps (>3 in 10 min) + rising temperature (>5°C increase) → hardware failure or cooling issue
4. BGP session drops + RPKI validation state changes → route origin change or RPKI misconfiguration
5. Asymmetric traffic on parallel interfaces + ECMP change → load balancing anomaly
6. DOM TX power declining (>0.5dBm/week trend) + BER increasing → fiber plant degradation or dirty connectors
7. Multiple interfaces on same linecard flapping → linecard hardware issue
8. Memory utilization >90% + process respawn in syslog → software crash loop
9. OSPF/ISIS adjacency drop + interface still up → MTU mismatch or authentication issue
10. Power supply alarm + fan speed increase → PSU failure or thermal event
auto_ticket: true for P1 and P2 severity.
{{few_shot_examples}}
user_template: |
Device: {{device}}
Interface: {{interface}}
Alert type: {{alert_type}}
Time of event: {{event_time}}
Alert data:
{{alert_data}}
Historical context (last 24h):
{{historical_context}}
Generate root cause analysis:
few_shot_examples:
- user: |
Device: core-router-01.fra.example.net
Interface: et-0/1/2
Alert type: DOM + CRC compound alert
Time of event: 2026-04-02 14:23:15 UTC
Alert data:
- CRC errors: 2847/min (threshold: 100/min)
- TX power: -4.2 dBm (was -2.8 dBm 48h ago, threshold: -3.5 dBm)
- RX power: -8.1 dBm (normal range: -7 to -9 dBm)
- Temperature: 42°C (normal: 38-45°C)
- Module: QSFP28-100G-LR4, SN: FOX2419P001
- Peer: dist-switch-02.fra, connected via 100m SMF patch
Historical context (last 24h):
- TX power declined from -2.8 to -4.2 dBm over 48h (1.4 dBm decline)
- CRC errors started at ~50/min 12h ago, now 2847/min
- No physical work logged on this fiber segment
assistant: |
{
"primary_hypothesis": {
"cause": "SFP/QSFP28 module degradation — TX laser output declining, causing BER increase and CRC errors on the receiving end",
"confidence": 9,
"evidence": [
"TX power declined 1.4 dBm in 48h (from -2.8 to -4.2 dBm) — clear degradation trend",
"CRC errors escalated from 50/min to 2847/min over 12h — consistent with SNR degradation from low TX power",
"RX power on peer side is normal (-8.1 dBm), ruling out fiber plant issue",
"No physical work logged — rules out connector disturbance",
"Module temperature is normal — not a thermal shutdown"
],
"impact": "100G link et-0/1/2 to dist-switch-02.fra is degraded. If CRC rate continues, link may drop. ECMP failover should kick in but verify standby path.",
"affected_components": ["et-0/1/2", "core-router-01.fra", "QSFP28 SN:FOX2419P001"]
},
"alternative_hypotheses": [
{
"cause": "Dirty or damaged fiber connector at module end",
"confidence": 3,
"evidence": ["TX power decline could indicate dirty connector rather than module failure"],
"why_less_likely": "RX power on peer is stable. A dirty connector at this end would show erratic power, not a steady decline trend. Also no physical work logged."
},
{
"cause": "Fiber plant issue between devices",
"confidence": 2,
"evidence": ["CRC errors could indicate fiber issue"],
"why_less_likely": "RX power (-8.1 dBm) is within normal range. Fiber plant issues affect RX power significantly. TX power declining on the module itself points to the module, not the fiber."
}
],
"recommended_checks": [
{
"check": "show interfaces et-0/1/2 diagnostics optics detail (or equivalent) — get full DOM readings",
"purpose": "Confirm TX power value and check if laser bias current is also declining (additional indicator of laser aging)",
"priority": "immediate"
},
{
"check": "Verify ECMP standby path is active and healthy before replacing module",
"purpose": "Ensure traffic can be diverted before taking down this interface",
"priority": "immediate"
},
{
"check": "Clean fiber connectors on et-0/1/2 and peer port before module replacement",
"purpose": "Rule out dirty connector as contributing factor",
"priority": "high"
},
{
"check": "Order replacement QSFP28-100G-LR4 module (SN: FOX2419P001 is failing)",
"purpose": "Module replacement is required regardless of connector cleaning outcome",
"priority": "high"
}
],
"auto_ticket": true,
"severity": "P2-major",
"correlation_rule_matched": "Rule 1: High CRC errors + dropping TX power → SFP degradation"
}
variables:
- device
- interface
- alert_type
- event_time
- alert_data
- historical_context
- few_shot_examples
validation_rules:
output_must_be_json: true
required_fields: ["primary_hypothesis", "recommended_checks", "auto_ticket", "severity"]