123 lines
3.2 KiB
Markdown
123 lines
3.2 KiB
Markdown
# MAGATAMA Atlas Rematerialization and Stale Resolver Fix
|
|
|
|
Date: 2026-05-09
|
|
|
|
## Problem
|
|
|
|
MAGATAMA had fallen back into an untrustworthy state:
|
|
|
|
- Atlas raw sources on Erik still existed and were current:
|
|
- `security-atlas-audits.json` with `3` audits
|
|
- `security-atlas-snapshot.json` with `32` devices
|
|
- but open findings in Postgres had collapsed back to `0`
|
|
- Atlas UI therefore looked implausibly empty / clean
|
|
|
|
The operator requirement was explicit:
|
|
|
|
- this must not silently happen again
|
|
- MAGATAMA must reflect real protection gaps honestly
|
|
|
|
## Root Cause
|
|
|
|
Two independent backend problems combined:
|
|
|
|
1. `buildProtectionProofResponse()` read Atlas raw files but did not resync findings from them.
|
|
2. Generic stale finding auto-resolution in the scheduler treated Atlas-managed findings like ordinary guard findings and resolved them too aggressively.
|
|
|
|
## Code Changes
|
|
|
|
### `packages/core/src/routes/health-builders.ts`
|
|
|
|
- added `readAtlasSnapshot()`
|
|
- imported `syncAtlasAuditFindings(...)`
|
|
- imported `syncAtlasExposureFindings(...)`
|
|
- introduced `syncAtlasOperationalFindings(...)`
|
|
- `buildProtectionProofResponse()` now calls that helper before building the proof payload
|
|
|
|
Effect:
|
|
|
|
- normal proof/Atlas reads now rematerialize current Atlas findings from the raw audit/snapshot files
|
|
|
|
### `packages/core/src/scheduler.ts`
|
|
|
|
- added:
|
|
- `ATLAS_MANAGED_FINDING_SOURCES`
|
|
- `isAtlasManagedFindingSource(...)`
|
|
- generic stale resolution now skips:
|
|
- `atlas-coverage-gap`
|
|
- `atlas-exposure`
|
|
- `atlas-host-audit`
|
|
|
|
Effect:
|
|
|
|
- Atlas-managed findings are no longer erased by the generic guard stale resolver
|
|
- they stay under their own verification-aware lifecycle
|
|
|
|
## Live Deployment
|
|
|
|
Deployed to Erik:
|
|
|
|
- rebuilt `@magatama/core`
|
|
- synced:
|
|
- `/opt/magatama/packages/core/dist/routes/health-builders.js`
|
|
- `/opt/magatama/packages/core/dist/scheduler.js`
|
|
- restarted PM2 app:
|
|
- `magatama`
|
|
|
|
## Live Verification
|
|
|
|
### Before
|
|
|
|
- raw files existed:
|
|
- audits: `3`
|
|
- devices: `32`
|
|
- DB open findings: `0`
|
|
|
|
### After protected proof rebuild
|
|
|
|
- authenticated local `/api/protection-proof` trigger on Erik
|
|
- DB open findings rematerialized to: `28`
|
|
|
|
### Public verification
|
|
|
|
Public MAGATAMA APIs now again expose real open state:
|
|
|
|
- `/api/findings?limit=5`
|
|
- returns open `atlas-coverage-gap` findings again
|
|
- `/api/protection-proof`
|
|
- `knownAssets: 57`
|
|
- `hostsWithTelemetry: 22`
|
|
- `assetsWithoutTelemetry: 35`
|
|
- `auditedHosts: 3`
|
|
- `queueBlocked: 28`
|
|
- `switchbladeAssets: 5`
|
|
- `switchbladeRacks: 1`
|
|
- `switchbladeNmsNodes: 5`
|
|
|
|
## Operational Truth
|
|
|
|
The major Atlas truthfulness regression is fixed:
|
|
|
|
- Atlas and Findings no longer silently collapse to a fake clean state when raw Atlas data still contains real problems
|
|
|
|
What remains true:
|
|
|
|
- most currently open Atlas findings are coverage gaps
|
|
- they represent real missing live telemetry on known assets
|
|
|
|
## Remaining Work
|
|
|
|
Still not fully closed:
|
|
|
|
- lane-specific RunPod artifact adoption and automatic version switching
|
|
- further Atlas policy refinement so inventory-only assets can be split more cleanly into:
|
|
- actionable operational gaps
|
|
- informational inventory/discovery context
|
|
|
|
## Operator Note
|
|
|
|
If the browser still shows the older empty Atlas state after deployment:
|
|
|
|
- hard refresh:
|
|
- `Cmd + Shift + R`
|