transceiver-db/sync/history/2026-05-06-magatama-coverage-gap-hardening-and-training-fix.md

108 lines
3.5 KiB
Markdown

# MAGATAMA Coverage-Gap Hardening And Training Fix
Date: 2026-05-06
Author: Codex
## Why this handoff exists
MAGATAMA looked clean at one point, then reopened a large batch of medium guard findings that were not all meaningful operational incidents. At the same time, verified MAGATAMA fixes were still at risk of landing in the wrong training output path.
This handoff records the concrete fixes, live verification, and training-pool updates.
## Problem observed
- Earlier live state:
- `49` open findings
- all `guard`
- all `medium`
- source: `atlas-coverage-gap`
- Root cause:
- Atlas coverage logic treated passive inventory/discovery assets as if they were operationally managed assets that had failed telemetry obligations.
- This created noisy coverage findings for:
- loopback
- passive inventory-only entries
- external reference assets
- assets without explicit operator intent
## Code changes made
MAGATAMA repo:
- `packages/core/src/routes/health-atlas.ts`
- `packages/core/src/routes/health-builders.ts`
- `packages/core/src/routes/health-types.ts`
- `packages/core/src/learning/fix-tracking.ts`
### Coverage-gap hardening
Added scope-aware logic so Atlas coverage findings only open for genuinely managed operational assets.
Signals now used to justify an operational coverage finding include:
- exposure evidence (`atlas-exposure`)
- explicit non-auto owner
- configured telemetry expectation
- high/critical criticality
- infrastructure metadata (`vendor`, `model`, `platform`, `rack`)
- managed infra device types (`server`, `switch`, `router`, `firewall`, `storage`, `nas`, `hypervisor`)
Explicit exclusions:
- loopback (`127.0.0.1`, `localhost`, `::1`)
- passive external reference assets
- inventory-only noise without operational scope signals
### Training integrity fix
Fixed inverted training output behavior:
- verified successes now write to:
- `training-data/gitea-learning-pool/magatamallm/fixes.jsonl`
- failed/escalated/report-only runs should go to:
- `training-data/gitea-learning-pool/magatamallm/errors.jsonl`
## Deployment and verification
Build:
- local `npm run build` in MAGATAMA completed successfully
Deploy:
- synced updated core `dist/routes/`
- synced updated core `dist/learning/`
- restarted PM2 app:
- `magatama`
Live verification on Erik:
- deployed files timestamp updated on `/opt/magatama/packages/core/dist/...`
- first post-restart guard scan log:
- `guard — first scan`
- `AutoResolve guard stale findings resolved: 33`
- Postgres after deploy:
- `open findings = 0`
## Training pool updates
Two new explicit solution entries were appended to the Gitea-backed MAGATAMA fixes corpus:
1. `Atlas coverage gaps should only become findings for managed operational assets`
2. `Verified fixes must never be written into the error corpus`
The updated `fixes.jsonl` was also synced to Erik:
- `/opt/magatama/training-data/gitea-learning-pool/magatamallm/fixes.jsonl`
## Important remaining note
Historic pollution may still exist from older runs where failed/escalated items were appended to `fixes.jsonl` before this path correction. The pathing bug is fixed for future writes, but a later cleanup pass may still be needed to scrub old invalid rows and backfill `errors.jsonl`.
## Operational takeaway
This was a real solution, not a suppression:
- Atlas noise is reduced by narrowing operational scope.
- Real managed assets can still produce coverage findings.
- Verified fixes and failed runs are now separated for future MagatamaLLM training quality.