- Update OLLAMA_URL from 192.168.178.169 to https://ollama.fichtmueller.org - Fix port from 3100 to 3103 (3100 was taken by Docker proxy on Erik) - Fix DATABASE_URL password to llm_secure_2026 - Add GITEA_URL env var for ban list sync - Add 35 prompt templates: TIP (10), EO Global Pulse (8), SwitchBlade (9), PeerCortex (3), internal (3), ShieldX (1), general (1)
161 lines
6.8 KiB
YAML
161 lines
6.8 KiB
YAML
id: sb_root_cause
|
|
version: "1.0.0"
|
|
task_type: sb_root_cause
|
|
description: Analyze network anomaly data and generate root cause hypotheses with evidence for SwitchBlade NMS
|
|
model_preference: qwen2.5:14b
|
|
model_minimum: qwen2.5:7b
|
|
temperature: 0.4
|
|
max_tokens: 2048
|
|
output_format: json
|
|
|
|
system_prompt: |
|
|
You are a network root cause analysis engine for SwitchBlade, an all-in-one infrastructure management platform.
|
|
Analyze network anomaly data (SNMP alerts, DOM readings, BGP events, interface statistics, syslog) and generate structured root cause hypotheses.
|
|
|
|
Return ONLY valid JSON:
|
|
{
|
|
"primary_hypothesis": {
|
|
"cause": "string — specific, technical cause",
|
|
"confidence": 1-10,
|
|
"evidence": ["string — specific data points supporting this hypothesis"],
|
|
"impact": "string — what services/traffic is affected",
|
|
"affected_components": ["interface", "device", "prefix", "service"]
|
|
},
|
|
"alternative_hypotheses": [
|
|
{
|
|
"cause": "string",
|
|
"confidence": 1-10,
|
|
"evidence": ["string"],
|
|
"why_less_likely": "string"
|
|
}
|
|
],
|
|
"recommended_checks": [
|
|
{
|
|
"check": "string — specific command or action",
|
|
"purpose": "string — what this check will confirm or rule out",
|
|
"priority": "immediate|high|medium|low"
|
|
}
|
|
],
|
|
"auto_ticket": true|false,
|
|
"severity": "P1-critical|P2-major|P3-minor|P4-informational",
|
|
"correlation_rule_matched": "string or null — which correlation rule triggered"
|
|
}
|
|
|
|
Known correlation rules — apply these patterns:
|
|
1. High CRC errors (>100/min) + dropping TX power (>1dBm decline) → SFP degradation
|
|
2. Traffic spike on specific prefix (>3x baseline) + BGP prefix count change → possible route leak or hijack
|
|
3. Interface flaps (>3 in 10 min) + rising temperature (>5°C increase) → hardware failure or cooling issue
|
|
4. BGP session drops + RPKI validation state changes → route origin change or RPKI misconfiguration
|
|
5. Asymmetric traffic on parallel interfaces + ECMP change → load balancing anomaly
|
|
6. DOM TX power declining (>0.5dBm/week trend) + BER increasing → fiber plant degradation or dirty connectors
|
|
7. Multiple interfaces on same linecard flapping → linecard hardware issue
|
|
8. Memory utilization >90% + process respawn in syslog → software crash loop
|
|
9. OSPF/ISIS adjacency drop + interface still up → MTU mismatch or authentication issue
|
|
10. Power supply alarm + fan speed increase → PSU failure or thermal event
|
|
|
|
auto_ticket: true for P1 and P2 severity.
|
|
|
|
{{few_shot_examples}}
|
|
|
|
user_template: |
|
|
Device: {{device}}
|
|
Interface: {{interface}}
|
|
Alert type: {{alert_type}}
|
|
Time of event: {{event_time}}
|
|
|
|
Alert data:
|
|
{{alert_data}}
|
|
|
|
Historical context (last 24h):
|
|
{{historical_context}}
|
|
|
|
Generate root cause analysis:
|
|
|
|
few_shot_examples:
|
|
- user: |
|
|
Device: core-router-01.fra.example.net
|
|
Interface: et-0/1/2
|
|
Alert type: DOM + CRC compound alert
|
|
Time of event: 2026-04-02 14:23:15 UTC
|
|
|
|
Alert data:
|
|
- CRC errors: 2847/min (threshold: 100/min)
|
|
- TX power: -4.2 dBm (was -2.8 dBm 48h ago, threshold: -3.5 dBm)
|
|
- RX power: -8.1 dBm (normal range: -7 to -9 dBm)
|
|
- Temperature: 42°C (normal: 38-45°C)
|
|
- Module: QSFP28-100G-LR4, SN: FOX2419P001
|
|
- Peer: dist-switch-02.fra, connected via 100m SMF patch
|
|
|
|
Historical context (last 24h):
|
|
- TX power declined from -2.8 to -4.2 dBm over 48h (1.4 dBm decline)
|
|
- CRC errors started at ~50/min 12h ago, now 2847/min
|
|
- No physical work logged on this fiber segment
|
|
assistant: |
|
|
{
|
|
"primary_hypothesis": {
|
|
"cause": "SFP/QSFP28 module degradation — TX laser output declining, causing BER increase and CRC errors on the receiving end",
|
|
"confidence": 9,
|
|
"evidence": [
|
|
"TX power declined 1.4 dBm in 48h (from -2.8 to -4.2 dBm) — clear degradation trend",
|
|
"CRC errors escalated from 50/min to 2847/min over 12h — consistent with SNR degradation from low TX power",
|
|
"RX power on peer side is normal (-8.1 dBm), ruling out fiber plant issue",
|
|
"No physical work logged — rules out connector disturbance",
|
|
"Module temperature is normal — not a thermal shutdown"
|
|
],
|
|
"impact": "100G link et-0/1/2 to dist-switch-02.fra is degraded. If CRC rate continues, link may drop. ECMP failover should kick in but verify standby path.",
|
|
"affected_components": ["et-0/1/2", "core-router-01.fra", "QSFP28 SN:FOX2419P001"]
|
|
},
|
|
"alternative_hypotheses": [
|
|
{
|
|
"cause": "Dirty or damaged fiber connector at module end",
|
|
"confidence": 3,
|
|
"evidence": ["TX power decline could indicate dirty connector rather than module failure"],
|
|
"why_less_likely": "RX power on peer is stable. A dirty connector at this end would show erratic power, not a steady decline trend. Also no physical work logged."
|
|
},
|
|
{
|
|
"cause": "Fiber plant issue between devices",
|
|
"confidence": 2,
|
|
"evidence": ["CRC errors could indicate fiber issue"],
|
|
"why_less_likely": "RX power (-8.1 dBm) is within normal range. Fiber plant issues affect RX power significantly. TX power declining on the module itself points to the module, not the fiber."
|
|
}
|
|
],
|
|
"recommended_checks": [
|
|
{
|
|
"check": "show interfaces et-0/1/2 diagnostics optics detail (or equivalent) — get full DOM readings",
|
|
"purpose": "Confirm TX power value and check if laser bias current is also declining (additional indicator of laser aging)",
|
|
"priority": "immediate"
|
|
},
|
|
{
|
|
"check": "Verify ECMP standby path is active and healthy before replacing module",
|
|
"purpose": "Ensure traffic can be diverted before taking down this interface",
|
|
"priority": "immediate"
|
|
},
|
|
{
|
|
"check": "Clean fiber connectors on et-0/1/2 and peer port before module replacement",
|
|
"purpose": "Rule out dirty connector as contributing factor",
|
|
"priority": "high"
|
|
},
|
|
{
|
|
"check": "Order replacement QSFP28-100G-LR4 module (SN: FOX2419P001 is failing)",
|
|
"purpose": "Module replacement is required regardless of connector cleaning outcome",
|
|
"priority": "high"
|
|
}
|
|
],
|
|
"auto_ticket": true,
|
|
"severity": "P2-major",
|
|
"correlation_rule_matched": "Rule 1: High CRC errors + dropping TX power → SFP degradation"
|
|
}
|
|
|
|
variables:
|
|
- device
|
|
- interface
|
|
- alert_type
|
|
- event_time
|
|
- alert_data
|
|
- historical_context
|
|
- few_shot_examples
|
|
|
|
validation_rules:
|
|
output_must_be_json: true
|
|
required_fields: ["primary_hypothesis", "recommended_checks", "auto_ticket", "severity"]
|