From 2a3576135c963f21c98bc5b2df0593cc35f2dbd4 Mon Sep 17 00:00:00 2001 From: Rene Fichtmueller Date: Thu, 7 May 2026 10:47:57 +0200 Subject: [PATCH] sync: record runpod managed endpoint root cause --- sync/CURRENT.md | 88 +++++++++++ ...point-root-cause-and-custom-worker-path.md | 147 ++++++++++++++++++ 2 files changed, 235 insertions(+) create mode 100644 sync/history/2026-05-07-magatama-runpod-managed-endpoint-root-cause-and-custom-worker-path.md diff --git a/sync/CURRENT.md b/sync/CURRENT.md index f0a5cc3..34f153c 100644 --- a/sync/CURRENT.md +++ b/sync/CURRENT.md @@ -2,6 +2,94 @@ Updated: 2026-05-07 08:05 UTC +## Newest Work + +- MAGATAMA RunPod training return-path deep dive on 2026-05-07: + - Attack Paths `Open Fix Guidance` placebo button was fixed live on Erik: + - `magatama/packages/dashboard/public/index-v2.html` + - real behavior now: + - if graph node maps to a real finding, open the existing ticket/finding drawer + - if node is only synthetic, show an explicit warning instead of doing nothing + - deployed to: + - `/opt/magatama/packages/dashboard/public/index-v2.html` + - `pm2 restart magatama-dashboard` executed + - local Mac train API truth rechecked: + - `GET http://127.0.0.1:3214/health` + - returns `status = ok` + - service is idle/reachable, not broken + - RunPod heartbeat/UI stream issue was fixed live: + - dashboard server now emits keepalive progress messages during: + - long `IN_PROGRESS` phases + - post-`COMPLETED` artifact verification loops + - deployed live to Erik dashboard + - direct raw RunPod status canary against the current endpoint (`dheii186pfcuq7`) was executed: + - tiny 1-step `tip_llm` canary job: + - `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2` + - observed raw status sequence: + - `IN_QUEUE` + - `IN_PROGRESS` + - `COMPLETED` + - **critical truth**: + - `/status/{job}` returned no `output` + - `/stream/{job}` returned: + - `{"status":"COMPLETED","stream":[]}` + - interpretation: + - the currently configured endpoint is the managed Axolotl serverless endpoint + - it does not return a programmatically adoptable artifact reference to MAGATAMA + - this is why all lanes keep ending in: + - `completed_without_model_artifact` + - Erik secrets reality rechecked: + - `/opt/magatama/secrets/hf-token` exists and is readable by the running process + - therefore the current failure is **not** caused by a missing HF token on Erik + - root cause now considered confirmed: + - the **managed Axolotl serverless endpoint** is acceptable for queueing/running a fine-tune + - but not sufficient for MAGATAMA's required full automation: + - train + - return explicit artifact + - adopt locally + - smoke-test + - create new release alias + - switch active alias + - code path for the correct architecture is now prepared: + - `magatama/packages/fine-tuner/runpod_handler.py` + - `magatama/packages/fine-tuner/train_cuda.py` + - `magatama/packages/fine-tuner/requirements-runpod.txt` + - `magatama/packages/dashboard/src/server.ts` + - what changed in that path: + - custom RunPod worker now accepts: + - `target_model` + - `credentials.hf_token` + - training script now: + - trains lane-specific bundle + - uploads the resulting adapter folder to Hugging Face + - returns `adapter_repo_id` + - dashboard custom-worker submit path now includes: + - `run_id` + - `target_model` + - HF credential pass-through for the worker + - dashboard error text is now explicit: + - if the managed Axolotl endpoint completes without an adoptable artifact, MAGATAMA says so plainly and points at the need for the `custom-magatama` worker + - live deployment status: + - updated dashboard server was rebuilt and deployed to Erik + - updated custom worker source files were synced into Erik repo state + - BUT: + - the currently active RunPod endpoint is still the managed Axolotl endpoint + - the new full return-path logic will only become effective once the RunPod endpoint is switched to the custom MAGATAMA worker image + - operational conclusion: + - training pool refresh, lane separation, submit flow, and local adoption API are now in good shape + - the final missing infrastructure step is: + - build/publish `packages/fine-tuner/Dockerfile.runpod` + - create/use a custom RunPod serverless endpoint for `runpod_handler.py` + - set: + - `RUNPOD_WORKER_KIND=custom-magatama` + - `RUNPOD_ENDPOINT_ID=` + - only then can MAGATAMA honestly achieve: + - automatic training + - automatic artifact return + - automatic adoption + - automatic version bump + - automatic alias switch after smoke tests + ## Active Policy - Put coordination notes and handoffs in this `sync/` folder and push to Gitea. diff --git a/sync/history/2026-05-07-magatama-runpod-managed-endpoint-root-cause-and-custom-worker-path.md b/sync/history/2026-05-07-magatama-runpod-managed-endpoint-root-cause-and-custom-worker-path.md new file mode 100644 index 0000000..12a4b17 --- /dev/null +++ b/sync/history/2026-05-07-magatama-runpod-managed-endpoint-root-cause-and-custom-worker-path.md @@ -0,0 +1,147 @@ +# 2026-05-07 MAGATAMA RunPod Managed-Endpoint Root Cause and Custom Worker Path + +## Summary + +We continued the MAGATAMA RunPod training automation investigation live and closed the remaining ambiguity. + +The current RunPod endpoint (`dheii186pfcuq7`) is the managed Axolotl serverless endpoint. It accepts jobs and reports lifecycle states such as `IN_QUEUE`, `IN_PROGRESS`, and `COMPLETED`, but it does **not** return a programmatically adoptable model artifact back to MAGATAMA. + +That means: + +- dataset refresh works +- lane-specific exports work +- training submit works +- local adoption API is healthy + +But the full automation chain still breaks on the return path. + +## Live Findings + +### Attack Paths fix guidance + +- `Open Fix Guidance` on Attack Paths was a placebo button. +- Fixed in: + - `magatama/packages/dashboard/public/index-v2.html` +- Live behavior now: + - opens the real finding/ticket drawer when the graph node maps to a finding + - otherwise shows an explicit warning + +### Local train API rechecked + +- `GET http://127.0.0.1:3214/health` +- result: + - `status = ok` + - service reachable + - service idle + +Conclusion: + +- local adoption/import service is not the current blocker + +### RunPod raw status canary + +A tiny direct canary was executed against the same endpoint: + +- lane: `tip_llm` +- steps: `1` +- job: + - `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2` + +Observed via raw `/status/{job}` polling: + +- `IN_QUEUE` +- `IN_PROGRESS` +- `COMPLETED` + +Critical detail: + +- `/status/{job}` had no `output` +- `/stream/{job}` returned: + - `{"status":"COMPLETED","stream":[]}` + +This confirms that the current endpoint does not hand MAGATAMA the explicit artifact metadata it needs for automatic adoption. + +### HF token check + +Erik was checked directly: + +- `/opt/magatama/secrets/hf-token` + - exists + - readable + +Conclusion: + +- the current failure is not a missing Hugging Face token on Erik + +## Root Cause + +The managed Axolotl serverless endpoint is not enough for MAGATAMA's required end-to-end automation. + +MAGATAMA needs a worker that can: + +1. train the lane-specific dataset +2. upload the resulting adapter/model artifact explicitly +3. return a machine-readable artifact reference +4. let MAGATAMA adopt/import that artifact +5. run smoke tests +6. bump version +7. switch the active alias + +The managed Axolotl endpoint currently only gives lifecycle state, not an adoptable return artifact. + +## Code Completed + +Prepared the correct custom-worker path in: + +- `magatama/packages/fine-tuner/train_cuda.py` +- `magatama/packages/fine-tuner/runpod_handler.py` +- `magatama/packages/fine-tuner/requirements-runpod.txt` +- `magatama/packages/dashboard/src/server.ts` + +### What changed + +- custom RunPod worker input now supports: + - `target_model` + - `credentials.hf_token` +- `train_cuda.py` now: + - trains from the signed MAGATAMA lane bundle + - uploads the resulting adapter folder to Hugging Face + - returns `adapter_repo_id` +- dashboard custom-worker submit path now sends: + - `run_id` + - `target_model` + - worker HF credential +- dashboard errors are now explicit when the managed endpoint completes without an adoptable artifact + +## Live Deployment Status + +Deployed live to Erik: + +- rebuilt and rsynced dashboard server +- synced updated custom worker source files into repo state on Erik +- restarted `pm2 magatama-dashboard` + +Not yet completed in infrastructure: + +- the active RunPod endpoint itself is still the managed Axolotl endpoint + +## Required Final Infra Step + +To get true full automation: + +1. build/publish: + - `magatama/packages/fine-tuner/Dockerfile.runpod` +2. create or switch to a custom RunPod serverless endpoint running: + - `runpod_handler.py` +3. set on Erik: + - `RUNPOD_WORKER_KIND=custom-magatama` + - `RUNPOD_ENDPOINT_ID=` + +Only then will MAGATAMA be able to: + +- pull the lane-specific training pool +- train on RunPod +- get back a real adapter artifact +- adopt it locally into Ollama +- write a new version number +- repoint the active alias after smoke tests