sync: record runpod managed endpoint root cause

2026-05-07 10:47:57 +02:00 · 2026-05-07 10:47:57 +02:00 · 2a3576135c
commit 2a3576135c
parent 21b56ead81
2 changed files with 235 additions and 0 deletions
--- a/sync/CURRENT.md
+++ b/sync/CURRENT.md
@ -2,6 +2,94 @@

 Updated: 2026-05-07 08:05 UTC

+## Newest Work
+
+- MAGATAMA RunPod training return-path deep dive on 2026-05-07:
+  - Attack Paths `Open Fix Guidance` placebo button was fixed live on Erik:
+    - `magatama/packages/dashboard/public/index-v2.html`
+    - real behavior now:
+      - if graph node maps to a real finding, open the existing ticket/finding drawer
+      - if node is only synthetic, show an explicit warning instead of doing nothing
+    - deployed to:
+      - `/opt/magatama/packages/dashboard/public/index-v2.html`
+    - `pm2 restart magatama-dashboard` executed
+  - local Mac train API truth rechecked:
+    - `GET http://127.0.0.1:3214/health`
+    - returns `status = ok`
+    - service is idle/reachable, not broken
+  - RunPod heartbeat/UI stream issue was fixed live:
+    - dashboard server now emits keepalive progress messages during:
+      - long `IN_PROGRESS` phases
+      - post-`COMPLETED` artifact verification loops
+    - deployed live to Erik dashboard
+  - direct raw RunPod status canary against the current endpoint (`dheii186pfcuq7`) was executed:
+    - tiny 1-step `tip_llm` canary job:
+      - `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`
+    - observed raw status sequence:
+      - `IN_QUEUE`
+      - `IN_PROGRESS`
+      - `COMPLETED`
+    - **critical truth**:
+      - `/status/{job}` returned no `output`
+      - `/stream/{job}` returned:
+        - `{"status":"COMPLETED","stream":[]}`
+    - interpretation:
+      - the currently configured endpoint is the managed Axolotl serverless endpoint
+      - it does not return a programmatically adoptable artifact reference to MAGATAMA
+      - this is why all lanes keep ending in:
+        - `completed_without_model_artifact`
+  - Erik secrets reality rechecked:
+    - `/opt/magatama/secrets/hf-token` exists and is readable by the running process
+    - therefore the current failure is **not** caused by a missing HF token on Erik
+  - root cause now considered confirmed:
+    - the **managed Axolotl serverless endpoint** is acceptable for queueing/running a fine-tune
+    - but not sufficient for MAGATAMA's required full automation:
+      - train
+      - return explicit artifact
+      - adopt locally
+      - smoke-test
+      - create new release alias
+      - switch active alias
+  - code path for the correct architecture is now prepared:
+    - `magatama/packages/fine-tuner/runpod_handler.py`
+    - `magatama/packages/fine-tuner/train_cuda.py`
+    - `magatama/packages/fine-tuner/requirements-runpod.txt`
+    - `magatama/packages/dashboard/src/server.ts`
+  - what changed in that path:
+    - custom RunPod worker now accepts:
+      - `target_model`
+      - `credentials.hf_token`
+    - training script now:
+      - trains lane-specific bundle
+      - uploads the resulting adapter folder to Hugging Face
+      - returns `adapter_repo_id`
+    - dashboard custom-worker submit path now includes:
+      - `run_id`
+      - `target_model`
+      - HF credential pass-through for the worker
+    - dashboard error text is now explicit:
+      - if the managed Axolotl endpoint completes without an adoptable artifact, MAGATAMA says so plainly and points at the need for the `custom-magatama` worker
+  - live deployment status:
+    - updated dashboard server was rebuilt and deployed to Erik
+    - updated custom worker source files were synced into Erik repo state
+    - BUT:
+      - the currently active RunPod endpoint is still the managed Axolotl endpoint
+      - the new full return-path logic will only become effective once the RunPod endpoint is switched to the custom MAGATAMA worker image
+  - operational conclusion:
+    - training pool refresh, lane separation, submit flow, and local adoption API are now in good shape
+    - the final missing infrastructure step is:
+      - build/publish `packages/fine-tuner/Dockerfile.runpod`
+      - create/use a custom RunPod serverless endpoint for `runpod_handler.py`
+      - set:
+        - `RUNPOD_WORKER_KIND=custom-magatama`
+        - `RUNPOD_ENDPOINT_ID=<custom-endpoint>`
+    - only then can MAGATAMA honestly achieve:
+      - automatic training
+      - automatic artifact return
+      - automatic adoption
+      - automatic version bump
+      - automatic alias switch after smoke tests
+
 ## Active Policy

 - Put coordination notes and handoffs in this `sync/` folder and push to Gitea.
--- a/sync/history/2026-05-07-magatama-runpod-managed-endpoint-root-cause-and-custom-worker-path.md
+++ b/sync/history/2026-05-07-magatama-runpod-managed-endpoint-root-cause-and-custom-worker-path.md
@ -0,0 +1,147 @@
+# 2026-05-07 MAGATAMA RunPod Managed-Endpoint Root Cause and Custom Worker Path
+
+## Summary
+
+We continued the MAGATAMA RunPod training automation investigation live and closed the remaining ambiguity.
+
+The current RunPod endpoint (`dheii186pfcuq7`) is the managed Axolotl serverless endpoint. It accepts jobs and reports lifecycle states such as `IN_QUEUE`, `IN_PROGRESS`, and `COMPLETED`, but it does **not** return a programmatically adoptable model artifact back to MAGATAMA.
+
+That means:
+
+- dataset refresh works
+- lane-specific exports work
+- training submit works
+- local adoption API is healthy
+
+But the full automation chain still breaks on the return path.
+
+## Live Findings
+
+### Attack Paths fix guidance
+
+- `Open Fix Guidance` on Attack Paths was a placebo button.
+- Fixed in:
+  - `magatama/packages/dashboard/public/index-v2.html`
+- Live behavior now:
+  - opens the real finding/ticket drawer when the graph node maps to a finding
+  - otherwise shows an explicit warning
+
+### Local train API rechecked
+
+- `GET http://127.0.0.1:3214/health`
+- result:
+  - `status = ok`
+  - service reachable
+  - service idle
+
+Conclusion:
+
+- local adoption/import service is not the current blocker
+
+### RunPod raw status canary
+
+A tiny direct canary was executed against the same endpoint:
+
+- lane: `tip_llm`
+- steps: `1`
+- job:
+  - `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`
+
+Observed via raw `/status/{job}` polling:
+
+- `IN_QUEUE`
+- `IN_PROGRESS`
+- `COMPLETED`
+
+Critical detail:
+
+- `/status/{job}` had no `output`
+- `/stream/{job}` returned:
+  - `{"status":"COMPLETED","stream":[]}`
+
+This confirms that the current endpoint does not hand MAGATAMA the explicit artifact metadata it needs for automatic adoption.
+
+### HF token check
+
+Erik was checked directly:
+
+- `/opt/magatama/secrets/hf-token`
+  - exists
+  - readable
+
+Conclusion:
+
+- the current failure is not a missing Hugging Face token on Erik
+
+## Root Cause
+
+The managed Axolotl serverless endpoint is not enough for MAGATAMA's required end-to-end automation.
+
+MAGATAMA needs a worker that can:
+
+1. train the lane-specific dataset
+2. upload the resulting adapter/model artifact explicitly
+3. return a machine-readable artifact reference
+4. let MAGATAMA adopt/import that artifact
+5. run smoke tests
+6. bump version
+7. switch the active alias
+
+The managed Axolotl endpoint currently only gives lifecycle state, not an adoptable return artifact.
+
+## Code Completed
+
+Prepared the correct custom-worker path in:
+
+- `magatama/packages/fine-tuner/train_cuda.py`
+- `magatama/packages/fine-tuner/runpod_handler.py`
+- `magatama/packages/fine-tuner/requirements-runpod.txt`
+- `magatama/packages/dashboard/src/server.ts`
+
+### What changed
+
+- custom RunPod worker input now supports:
+  - `target_model`
+  - `credentials.hf_token`
+- `train_cuda.py` now:
+  - trains from the signed MAGATAMA lane bundle
+  - uploads the resulting adapter folder to Hugging Face
+  - returns `adapter_repo_id`
+- dashboard custom-worker submit path now sends:
+  - `run_id`
+  - `target_model`
+  - worker HF credential
+- dashboard errors are now explicit when the managed endpoint completes without an adoptable artifact
+
+## Live Deployment Status
+
+Deployed live to Erik:
+
+- rebuilt and rsynced dashboard server
+- synced updated custom worker source files into repo state on Erik
+- restarted `pm2 magatama-dashboard`
+
+Not yet completed in infrastructure:
+
+- the active RunPod endpoint itself is still the managed Axolotl endpoint
+
+## Required Final Infra Step
+
+To get true full automation:
+
+1. build/publish:
+   - `magatama/packages/fine-tuner/Dockerfile.runpod`
+2. create or switch to a custom RunPod serverless endpoint running:
+   - `runpod_handler.py`
+3. set on Erik:
+   - `RUNPOD_WORKER_KIND=custom-magatama`
+   - `RUNPOD_ENDPOINT_ID=<custom-endpoint-id>`
+
+Only then will MAGATAMA be able to:
+
+- pull the lane-specific training pool
+- train on RunPod
+- get back a real adapter artifact
+- adopt it locally into Ollama
+- write a new version number
+- repoint the active alias after smoke tests