transceiver-db/sync/history/2026-05-06-magatama-runpod-serverless-materialization-check.md
2026-05-06 13:07:26 +02:00

2.3 KiB

2026-05-06 — MAGATAMA RunPod serverless materialization check

Summary

MAGATAMA's RunPod submit path was hardened and re-tested against the queue-based Axolotl serverless endpoint dheii186pfcuq7.

What changed

  • Payload alignment was tightened toward the official Axolotl serverless schema:
    • added model_type=AutoModelForCausalLM
    • added tokenizer_type=AutoTokenizer
    • switched dataset split declaration to split: train
    • switched optimizer from adamw_8bit to adamw_torch_fused
  • Both submit paths now distinguish between:
    • /run accepted
    • /status/{job} actually exists
  • Updated files:
    • magatama/packages/dashboard/src/server.ts
    • magatama/scripts/submit_runpod_training.ts

Verified behavior

Full run attempt

  • Submitted magatamallm 500-step run.
  • Returned job id: 9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2
  • Reconcile result shortly after:
    • not_found_after_submit
    • HTTP 404
    • job not found

Canary run after payload/schema fix

  • Submitted magatamallm seed-only canary.
  • Returned job id: a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2
  • Immediate submit-side verification saw real queue materialization:
    • runpod_status: IN_QUEUE
  • Reconcile roughly 45 seconds later still observed:
    • not_found_after_submit
    • HTTP 404
    • job not found

Conclusion

The old MAGATAMA bug (blindly trusting /run) is fixed.

The remaining problem is now narrower and likely external to MAGATAMA itself:

  • RunPod serverless currently accepts the submit and briefly materializes the job as IN_QUEUE,
  • but the job disappears before a durable status/progress/completion lifecycle can be observed.

This means the endpoint/release is still not trustworthy enough for a full production training launch until it can keep a job alive beyond the initial queue stage.

Operational rule

Do not treat submitted or even a brief IN_QUEUE as proof of a usable serverless training run. A MAGATAMA serverless training run is only trustworthy when at least one of these is true:

  • status progresses to IN_PROGRESS, or
  • a durable terminal state is observed with artifact evidence.

Open next step

  • Inspect the actual RunPod serverless endpoint/release configuration and worker-side logs in RunPod UI.
  • Only launch the full MagatamaLLM run after a canary survives beyond queue materialization.