transceiver-db/sync/history/2026-05-09-magatama-atlas-rematerialization-and-stale-resolver-fix.md
2026-05-09 08:02:54 +02:00

123 lines
3.2 KiB
Markdown

# MAGATAMA Atlas Rematerialization and Stale Resolver Fix
Date: 2026-05-09
## Problem
MAGATAMA had fallen back into an untrustworthy state:
- Atlas raw sources on Erik still existed and were current:
- `security-atlas-audits.json` with `3` audits
- `security-atlas-snapshot.json` with `32` devices
- but open findings in Postgres had collapsed back to `0`
- Atlas UI therefore looked implausibly empty / clean
The operator requirement was explicit:
- this must not silently happen again
- MAGATAMA must reflect real protection gaps honestly
## Root Cause
Two independent backend problems combined:
1. `buildProtectionProofResponse()` read Atlas raw files but did not resync findings from them.
2. Generic stale finding auto-resolution in the scheduler treated Atlas-managed findings like ordinary guard findings and resolved them too aggressively.
## Code Changes
### `packages/core/src/routes/health-builders.ts`
- added `readAtlasSnapshot()`
- imported `syncAtlasAuditFindings(...)`
- imported `syncAtlasExposureFindings(...)`
- introduced `syncAtlasOperationalFindings(...)`
- `buildProtectionProofResponse()` now calls that helper before building the proof payload
Effect:
- normal proof/Atlas reads now rematerialize current Atlas findings from the raw audit/snapshot files
### `packages/core/src/scheduler.ts`
- added:
- `ATLAS_MANAGED_FINDING_SOURCES`
- `isAtlasManagedFindingSource(...)`
- generic stale resolution now skips:
- `atlas-coverage-gap`
- `atlas-exposure`
- `atlas-host-audit`
Effect:
- Atlas-managed findings are no longer erased by the generic guard stale resolver
- they stay under their own verification-aware lifecycle
## Live Deployment
Deployed to Erik:
- rebuilt `@magatama/core`
- synced:
- `/opt/magatama/packages/core/dist/routes/health-builders.js`
- `/opt/magatama/packages/core/dist/scheduler.js`
- restarted PM2 app:
- `magatama`
## Live Verification
### Before
- raw files existed:
- audits: `3`
- devices: `32`
- DB open findings: `0`
### After protected proof rebuild
- authenticated local `/api/protection-proof` trigger on Erik
- DB open findings rematerialized to: `28`
### Public verification
Public MAGATAMA APIs now again expose real open state:
- `/api/findings?limit=5`
- returns open `atlas-coverage-gap` findings again
- `/api/protection-proof`
- `knownAssets: 57`
- `hostsWithTelemetry: 22`
- `assetsWithoutTelemetry: 35`
- `auditedHosts: 3`
- `queueBlocked: 28`
- `switchbladeAssets: 5`
- `switchbladeRacks: 1`
- `switchbladeNmsNodes: 5`
## Operational Truth
The major Atlas truthfulness regression is fixed:
- Atlas and Findings no longer silently collapse to a fake clean state when raw Atlas data still contains real problems
What remains true:
- most currently open Atlas findings are coverage gaps
- they represent real missing live telemetry on known assets
## Remaining Work
Still not fully closed:
- lane-specific RunPod artifact adoption and automatic version switching
- further Atlas policy refinement so inventory-only assets can be split more cleanly into:
- actionable operational gaps
- informational inventory/discovery context
## Operator Note
If the browser still shows the older empty Atlas state after deployment:
- hard refresh:
- `Cmd + Shift + R`