transceiver-db/sync/history/2026-05-06-magatama-runpod-serverless-materialization-check.md
2026-05-06 13:07:26 +02:00

66 lines
2.3 KiB
Markdown

# 2026-05-06 — MAGATAMA RunPod serverless materialization check
## Summary
MAGATAMA's RunPod submit path was hardened and re-tested against the queue-based Axolotl serverless endpoint `dheii186pfcuq7`.
## What changed
- Payload alignment was tightened toward the official Axolotl serverless schema:
- added `model_type=AutoModelForCausalLM`
- added `tokenizer_type=AutoTokenizer`
- switched dataset split declaration to `split: train`
- switched optimizer from `adamw_8bit` to `adamw_torch_fused`
- Both submit paths now distinguish between:
- `/run` accepted
- `/status/{job}` actually exists
- Updated files:
- `magatama/packages/dashboard/src/server.ts`
- `magatama/scripts/submit_runpod_training.ts`
## Verified behavior
### Full run attempt
- Submitted `magatamallm` 500-step run.
- Returned job id: `9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2`
- Reconcile result shortly after:
- `not_found_after_submit`
- HTTP `404`
- `job not found`
### Canary run after payload/schema fix
- Submitted `magatamallm` seed-only canary.
- Returned job id: `a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2`
- Immediate submit-side verification saw real queue materialization:
- `runpod_status: IN_QUEUE`
- Reconcile roughly 45 seconds later still observed:
- `not_found_after_submit`
- HTTP `404`
- `job not found`
## Conclusion
The old MAGATAMA bug (blindly trusting `/run`) is fixed.
The remaining problem is now narrower and likely external to MAGATAMA itself:
- RunPod serverless currently accepts the submit and briefly materializes the job as `IN_QUEUE`,
- but the job disappears before a durable status/progress/completion lifecycle can be observed.
This means the endpoint/release is still not trustworthy enough for a full production training launch until it can keep a job alive beyond the initial queue stage.
## Operational rule
Do **not** treat `submitted` or even a brief `IN_QUEUE` as proof of a usable serverless training run.
A MAGATAMA serverless training run is only trustworthy when at least one of these is true:
- status progresses to `IN_PROGRESS`, or
- a durable terminal state is observed with artifact evidence.
## Open next step
- Inspect the actual RunPod serverless endpoint/release configuration and worker-side logs in RunPod UI.
- Only launch the full MagatamaLLM run after a canary survives beyond queue materialization.