diff --git a/sync/CURRENT.md b/sync/CURRENT.md index bc3b7fe..f0a5cc3 100644 --- a/sync/CURRENT.md +++ b/sync/CURRENT.md @@ -1,6 +1,6 @@ # Current TIP Sync State -Updated: 2026-05-07 02:58 UTC +Updated: 2026-05-07 08:05 UTC ## Active Policy @@ -27,6 +27,60 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr ## Latest Work +- RunPod/MAGATAMA training live follow-up on 2026-05-07: + - latest `magatamallm` serverless run verified on Erik: + - job id: + - `ad003f90-3cf9-43f6-8960-bf6c1ea85097-e2` + - registry truth in: + - `/opt/magatama/training-data/model-registry/training-runs.json` + - observed states: + - `submitted` + - then `completed_without_model_artifact` + - exact recorded warning: + - `RunPod meldete COMPLETED, aber das erwartete HuggingFace-Modellrepo wurde nicht gefunden.` + - interpretation: + - dataset build and RunPod submit are working + - the worker still does not return a verifiable adoptable model artifact + - this is a real training return-path failure, not just a cosmetic UI issue + - local training API truth rechecked: + - `GET http://127.0.0.1:3214/health` + - service responds with: + - `status = ok` + - `service = magatama-train-api` + - `running = false` + - `pid = null` + - meaning: + - API is healthy/reachable + - currently idle + - ready for adoption/import calls once a valid RunPod artifact exists + - one UI bug in the training modal was fixed live: + - root cause: + - during long `IN_PROGRESS` and post-`COMPLETED` artifact verification phases, MAGATAMA sent no heartbeat for too long + - browser/proxy could then terminate the stream and surface only: + - `network error` + - even though Erik had already written the more truthful registry state + - fix: + - `magatama/packages/dashboard/src/server.ts` + - added server-sent heartbeat messages while: + - RunPod status remains unchanged + - Hugging Face / artifact propagation checks are still running + - concrete live strings now deployed in Erik dashboard server: + - `⏳ RunPod arbeitet weiter (...)` + - `⏳ Prüfe Modellartefakt ...` + - deployment: + - rebuilt dashboard + - rsynced `packages/dashboard/dist/server.js` to Erik + - restarted `pm2 magatama-dashboard` + - remote `server.js` verified to contain heartbeat strings + - expected operator effect: + - future training runs should no longer collapse into a late generic `network error` while RunPod/adoption checks are still active + - the UI should stay alive long enough to show the real terminal result: + - `completed_and_adopted` + - or + - `completed_without_model_artifact` + - or + - worker/adoption failure + - MAGATAMA live follow-up on 2026-05-07: - local Mac training API was rechecked after the lane-specific automation changes. - current live truth: diff --git a/sync/history/2026-05-07-magatama-runpod-heartbeat-and-real-terminal-status.md b/sync/history/2026-05-07-magatama-runpod-heartbeat-and-real-terminal-status.md new file mode 100644 index 0000000..04d9635 --- /dev/null +++ b/sync/history/2026-05-07-magatama-runpod-heartbeat-and-real-terminal-status.md @@ -0,0 +1,125 @@ +# MAGATAMA RunPod Heartbeat and Real Terminal Status + +Date: 2026-05-07 UTC + +## Scope + +- MAGATAMA dashboard training modal +- RunPod serverless training status truth +- local Mac training API sanity check + +## What Was Observed + +Latest verified `magatamallm` run: + +- job id: + - `ad003f90-3cf9-43f6-8960-bf6c1ea85097-e2` + +On Erik, the run registry recorded: + +- `submitted` +- then: + - `completed_without_model_artifact` + +Registry source: + +- `/opt/magatama/training-data/model-registry/training-runs.json` + +Recorded warning: + +- `RunPod meldete COMPLETED, aber das erwartete HuggingFace-Modellrepo wurde nicht gefunden.` + +## Conclusion + +This proves: + +- training dataset build worked +- RunPod submit worked +- the return path still failed because no adoptable model artifact was verified + +This was not just a cosmetic issue. + +## Separate UI Failure + +There was also a UX/runtime bug in the MAGATAMA dashboard modal: + +- while RunPod stayed `IN_PROGRESS` +- or while MAGATAMA waited for artifact visibility after `COMPLETED` + +the SSE stream could go quiet too long. + +Result: + +- browser/proxy would terminate the stream +- user only saw: + - `network error` + +even though Erik already had the more truthful internal status. + +## Fix Applied + +File: + +- `magatama/packages/dashboard/src/server.ts` + +Changes: + +- added periodic SSE heartbeat messages while the RunPod status remains unchanged: + - `⏳ RunPod arbeitet weiter (...)` +- added periodic SSE heartbeat messages while MAGATAMA checks artifact visibility: + - `⏳ Prüfe Modellartefakt ...` + +## Local Training API Recheck + +Local service: + +- `http://127.0.0.1:3214/health` + +Verified response: + +- `status = ok` +- `service = magatama-train-api` +- `running = false` +- `pid = null` + +Interpretation: + +- the local training/adoption API is healthy and reachable +- it is currently idle, not broken +- it is ready for adoption once a valid RunPod artifact exists + +## Live Deployment + +Deployed to Erik: + +- rebuilt dashboard server +- rsynced: + - `/opt/magatama/packages/dashboard/dist/server.js` +- restarted: + - `pm2 restart magatama-dashboard` + +Remote verification confirmed the new server bundle contains: + +- `⏳ RunPod arbeitet weiter` + +## Operational Impact + +Future runs should no longer collapse into a misleading generic `network error` during long polling/verification silence. + +The expected visible end states should now be the real ones: + +- `completed_and_adopted` +- `completed_without_model_artifact` +- adoption failure +- worker failure + +## Remaining Hard Truth + +The artifact/adoption problem itself is **not fixed yet**. + +Current state: + +- RunPod jobs can reach `COMPLETED` +- but MAGATAMA still has no verified returned model artifact to import and version-switch + +That remains the next required fix block.