transceiver-db/sync/history/2026-05-07-magatama-runpod-heartbeat-and-real-terminal-status.md
2026-05-07 10:06:40 +02:00

2.7 KiB

MAGATAMA RunPod Heartbeat and Real Terminal Status

Date: 2026-05-07 UTC

Scope

  • MAGATAMA dashboard training modal
  • RunPod serverless training status truth
  • local Mac training API sanity check

What Was Observed

Latest verified magatamallm run:

  • job id:
    • ad003f90-3cf9-43f6-8960-bf6c1ea85097-e2

On Erik, the run registry recorded:

  • submitted
  • then:
    • completed_without_model_artifact

Registry source:

  • /opt/magatama/training-data/model-registry/training-runs.json

Recorded warning:

  • RunPod meldete COMPLETED, aber das erwartete HuggingFace-Modellrepo wurde nicht gefunden.

Conclusion

This proves:

  • training dataset build worked
  • RunPod submit worked
  • the return path still failed because no adoptable model artifact was verified

This was not just a cosmetic issue.

Separate UI Failure

There was also a UX/runtime bug in the MAGATAMA dashboard modal:

  • while RunPod stayed IN_PROGRESS
  • or while MAGATAMA waited for artifact visibility after COMPLETED

the SSE stream could go quiet too long.

Result:

  • browser/proxy would terminate the stream
  • user only saw:
    • network error

even though Erik already had the more truthful internal status.

Fix Applied

File:

  • magatama/packages/dashboard/src/server.ts

Changes:

  • added periodic SSE heartbeat messages while the RunPod status remains unchanged:
    • ⏳ RunPod arbeitet weiter (...)
  • added periodic SSE heartbeat messages while MAGATAMA checks artifact visibility:
    • ⏳ Prüfe Modellartefakt ...

Local Training API Recheck

Local service:

  • http://127.0.0.1:3214/health

Verified response:

  • status = ok
  • service = magatama-train-api
  • running = false
  • pid = null

Interpretation:

  • the local training/adoption API is healthy and reachable
  • it is currently idle, not broken
  • it is ready for adoption once a valid RunPod artifact exists

Live Deployment

Deployed to Erik:

  • rebuilt dashboard server
  • rsynced:
    • /opt/magatama/packages/dashboard/dist/server.js
  • restarted:
    • pm2 restart magatama-dashboard

Remote verification confirmed the new server bundle contains:

  • ⏳ RunPod arbeitet weiter

Operational Impact

Future runs should no longer collapse into a misleading generic network error during long polling/verification silence.

The expected visible end states should now be the real ones:

  • completed_and_adopted
  • completed_without_model_artifact
  • adoption failure
  • worker failure

Remaining Hard Truth

The artifact/adoption problem itself is not fixed yet.

Current state:

  • RunPod jobs can reach COMPLETED
  • but MAGATAMA still has no verified returned model artifact to import and version-switch

That remains the next required fix block.