sync: record runpod serverless materialization check

2026-05-06 13:07:26 +02:00 · 2026-05-06 13:07:26 +02:00 · 9bc84a89ee
commit 9bc84a89ee
parent b5d9b4df03
2 changed files with 85 additions and 0 deletions
--- a/sync/CURRENT.md
+++ b/sync/CURRENT.md
@ -70,6 +70,26 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
    - Hugging Face dataset source
    - or MAGATAMA URL-bundle dataset source
  - this avoids misleading `RunPod not reachable` wording when the actual failure is in dataset preparation.
+  - follow-up serverless verification on 2026-05-06 narrowed the remaining fault further:
+    - MAGATAMA submit logic now verifies that a RunPod job really exists under `/status/{jobId}` instead of trusting `/run`.
+    - payloads were aligned more closely with the official Axolotl serverless schema:
+      - `model_type=AutoModelForCausalLM`
+      - `tokenizer_type=AutoTokenizer`
+      - dataset `split: train`
+      - optimizer `adamw_torch_fused`
+    - verified full run attempt:
+      - job id `9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2`
+      - disappeared as `not_found_after_submit` (`404 job not found`)
+    - verified canary after payload fix:
+      - job id `a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2`
+      - immediately materialized as `IN_QUEUE`
+      - then still disappeared on later reconcile as `not_found_after_submit`
+    - current conclusion:
+      - the old MAGATAMA bug is fixed.
+      - the remaining problem is now likely on the RunPod endpoint/release side: jobs are accepted and briefly queued, but do not survive long enough to produce a durable serverless status lifecycle.
+    - operational rule:
+      - do not treat `submitted` or a brief `IN_QUEUE` as proof of a usable serverless training run.
+      - only trust the run once it reaches `IN_PROGRESS` or a durable terminal state with artifact evidence.

 - MAGATAMA was repaired end-to-end to a clean operational baseline:
  - live guard host-audits for Erik, Mac Studio, and Proxmox were corrected and rerun.
--- a/sync/history/2026-05-06-magatama-runpod-serverless-materialization-check.md
+++ b/sync/history/2026-05-06-magatama-runpod-serverless-materialization-check.md
@ -0,0 +1,65 @@
+# 2026-05-06 — MAGATAMA RunPod serverless materialization check
+
+## Summary
+
+MAGATAMA's RunPod submit path was hardened and re-tested against the queue-based Axolotl serverless endpoint `dheii186pfcuq7`.
+
+## What changed
+
+- Payload alignment was tightened toward the official Axolotl serverless schema:
+  - added `model_type=AutoModelForCausalLM`
+  - added `tokenizer_type=AutoTokenizer`
+  - switched dataset split declaration to `split: train`
+  - switched optimizer from `adamw_8bit` to `adamw_torch_fused`
+- Both submit paths now distinguish between:
+  - `/run` accepted
+  - `/status/{job}` actually exists
+- Updated files:
+  - `magatama/packages/dashboard/src/server.ts`
+  - `magatama/scripts/submit_runpod_training.ts`
+
+## Verified behavior
+
+### Full run attempt
+
+- Submitted `magatamallm` 500-step run.
+- Returned job id: `9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2`
+- Reconcile result shortly after:
+  - `not_found_after_submit`
+  - HTTP `404`
+  - `job not found`
+
+### Canary run after payload/schema fix
+
+- Submitted `magatamallm` seed-only canary.
+- Returned job id: `a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2`
+- Immediate submit-side verification saw real queue materialization:
+  - `runpod_status: IN_QUEUE`
+- Reconcile roughly 45 seconds later still observed:
+  - `not_found_after_submit`
+  - HTTP `404`
+  - `job not found`
+
+## Conclusion
+
+The old MAGATAMA bug (blindly trusting `/run`) is fixed.
+
+The remaining problem is now narrower and likely external to MAGATAMA itself:
+
+- RunPod serverless currently accepts the submit and briefly materializes the job as `IN_QUEUE`,
+- but the job disappears before a durable status/progress/completion lifecycle can be observed.
+
+This means the endpoint/release is still not trustworthy enough for a full production training launch until it can keep a job alive beyond the initial queue stage.
+
+## Operational rule
+
+Do **not** treat `submitted` or even a brief `IN_QUEUE` as proof of a usable serverless training run.
+A MAGATAMA serverless training run is only trustworthy when at least one of these is true:
+
+- status progresses to `IN_PROGRESS`, or
+- a durable terminal state is observed with artifact evidence.
+
+## Open next step
+
+- Inspect the actual RunPod serverless endpoint/release configuration and worker-side logs in RunPod UI.
+- Only launch the full MagatamaLLM run after a canary survives beyond queue materialization.