sync: record runpod managed endpoint root cause

2026-05-07 10:47:57 +02:00 · 2026-05-07 10:47:57 +02:00 · 2a3576135c
commit 2a3576135c
parent 21b56ead81
2 changed files with 235 additions and 0 deletions
--- a/sync/CURRENT.md
+++ b/sync/CURRENT.md
@ -2,6 +2,94 @@
 Updated: 2026-05-07 08:05 UTC
 ## Newest Work
 - MAGATAMA RunPod training return-path deep dive on 2026-05-07:
  - Attack Paths `Open Fix Guidance` placebo button was fixed live on Erik:
    - `magatama/packages/dashboard/public/index-v2.html`
    - real behavior now:
      - if graph node maps to a real finding, open the existing ticket/finding drawer
      - if node is only synthetic, show an explicit warning instead of doing nothing
    - deployed to:
      - `/opt/magatama/packages/dashboard/public/index-v2.html`
    - `pm2 restart magatama-dashboard` executed
  - local Mac train API truth rechecked:
    - `GET http://127.0.0.1:3214/health`
    - returns `status = ok`
    - service is idle/reachable, not broken
  - RunPod heartbeat/UI stream issue was fixed live:
    - dashboard server now emits keepalive progress messages during:
      - long `IN_PROGRESS` phases
      - post-`COMPLETED` artifact verification loops
    - deployed live to Erik dashboard
  - direct raw RunPod status canary against the current endpoint (`dheii186pfcuq7`) was executed:
    - tiny 1-step `tip_llm` canary job:
      - `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`
    - observed raw status sequence:
      - `IN_QUEUE`
      - `IN_PROGRESS`
      - `COMPLETED`
    - **critical truth**:
      - `/status/{job}` returned no `output`
      - `/stream/{job}` returned:
        - `{"status":"COMPLETED","stream":[]}`
    - interpretation:
      - the currently configured endpoint is the managed Axolotl serverless endpoint
      - it does not return a programmatically adoptable artifact reference to MAGATAMA
      - this is why all lanes keep ending in:
        - `completed_without_model_artifact`
  - Erik secrets reality rechecked:
    - `/opt/magatama/secrets/hf-token` exists and is readable by the running process
    - therefore the current failure is **not** caused by a missing HF token on Erik
  - root cause now considered confirmed:
    - the **managed Axolotl serverless endpoint** is acceptable for queueing/running a fine-tune
    - but not sufficient for MAGATAMA's required full automation:
      - train
      - return explicit artifact
      - adopt locally
      - smoke-test
      - create new release alias
      - switch active alias
  - code path for the correct architecture is now prepared:
    - `magatama/packages/fine-tuner/runpod_handler.py`
    - `magatama/packages/fine-tuner/train_cuda.py`
    - `magatama/packages/fine-tuner/requirements-runpod.txt`
    - `magatama/packages/dashboard/src/server.ts`
  - what changed in that path:
    - custom RunPod worker now accepts:
      - `target_model`
      - `credentials.hf_token`
    - training script now:
      - trains lane-specific bundle
      - uploads the resulting adapter folder to Hugging Face
      - returns `adapter_repo_id`
    - dashboard custom-worker submit path now includes:
      - `run_id`
      - `target_model`
      - HF credential pass-through for the worker
    - dashboard error text is now explicit:
      - if the managed Axolotl endpoint completes without an adoptable artifact, MAGATAMA says so plainly and points at the need for the `custom-magatama` worker
  - live deployment status:
    - updated dashboard server was rebuilt and deployed to Erik
    - updated custom worker source files were synced into Erik repo state
    - BUT:
      - the currently active RunPod endpoint is still the managed Axolotl endpoint
      - the new full return-path logic will only become effective once the RunPod endpoint is switched to the custom MAGATAMA worker image
  - operational conclusion:
    - training pool refresh, lane separation, submit flow, and local adoption API are now in good shape
    - the final missing infrastructure step is:
      - build/publish `packages/fine-tuner/Dockerfile.runpod`
      - create/use a custom RunPod serverless endpoint for `runpod_handler.py`
      - set:
        - `RUNPOD_WORKER_KIND=custom-magatama`
        - `RUNPOD_ENDPOINT_ID=<custom-endpoint>`
    - only then can MAGATAMA honestly achieve:
      - automatic training
      - automatic artifact return
      - automatic adoption
      - automatic version bump
      - automatic alias switch after smoke tests
 ## Active Policy
 - Put coordination notes and handoffs in this `sync/` folder and push to Gitea.
--- a/sync/history/2026-05-07-magatama-runpod-managed-endpoint-root-cause-and-custom-worker-path.md
+++ b/sync/history/2026-05-07-magatama-runpod-managed-endpoint-root-cause-and-custom-worker-path.md
@ -0,0 +1,147 @@
 # 2026-05-07 MAGATAMA RunPod Managed-Endpoint Root Cause and Custom Worker Path
 ## Summary
 We continued the MAGATAMA RunPod training automation investigation live and closed the remaining ambiguity.
 The current RunPod endpoint (`dheii186pfcuq7`) is the managed Axolotl serverless endpoint. It accepts jobs and reports lifecycle states such as `IN_QUEUE`, `IN_PROGRESS`, and `COMPLETED`, but it does **not** return a programmatically adoptable model artifact back to MAGATAMA.
 That means:
 - dataset refresh works
 - lane-specific exports work
 - training submit works
 - local adoption API is healthy
 But the full automation chain still breaks on the return path.
 ## Live Findings
 ### Attack Paths fix guidance
 - `Open Fix Guidance` on Attack Paths was a placebo button.
 - Fixed in:
  - `magatama/packages/dashboard/public/index-v2.html`
 - Live behavior now:
  - opens the real finding/ticket drawer when the graph node maps to a finding
  - otherwise shows an explicit warning
 ### Local train API rechecked
 - `GET http://127.0.0.1:3214/health`
 - result:
  - `status = ok`
  - service reachable
  - service idle
 Conclusion:
 - local adoption/import service is not the current blocker
 ### RunPod raw status canary
 A tiny direct canary was executed against the same endpoint:
 - lane: `tip_llm`
 - steps: `1`
 - job:
  - `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`
 Observed via raw `/status/{job}` polling:
 - `IN_QUEUE`
 - `IN_PROGRESS`
 - `COMPLETED`
 Critical detail:
 - `/status/{job}` had no `output`
 - `/stream/{job}` returned:
  - `{"status":"COMPLETED","stream":[]}`
 This confirms that the current endpoint does not hand MAGATAMA the explicit artifact metadata it needs for automatic adoption.
 ### HF token check
 Erik was checked directly:
 - `/opt/magatama/secrets/hf-token`
  - exists
  - readable
 Conclusion:
 - the current failure is not a missing Hugging Face token on Erik
 ## Root Cause
 The managed Axolotl serverless endpoint is not enough for MAGATAMA's required end-to-end automation.
 MAGATAMA needs a worker that can:
 1. train the lane-specific dataset
 2. upload the resulting adapter/model artifact explicitly
 3. return a machine-readable artifact reference
 4. let MAGATAMA adopt/import that artifact
 5. run smoke tests
 6. bump version
 7. switch the active alias
 The managed Axolotl endpoint currently only gives lifecycle state, not an adoptable return artifact.
 ## Code Completed
 Prepared the correct custom-worker path in:
 - `magatama/packages/fine-tuner/train_cuda.py`
 - `magatama/packages/fine-tuner/runpod_handler.py`
 - `magatama/packages/fine-tuner/requirements-runpod.txt`
 - `magatama/packages/dashboard/src/server.ts`
 ### What changed
 - custom RunPod worker input now supports:
  - `target_model`
  - `credentials.hf_token`
 - `train_cuda.py` now:
  - trains from the signed MAGATAMA lane bundle
  - uploads the resulting adapter folder to Hugging Face
  - returns `adapter_repo_id`
 - dashboard custom-worker submit path now sends:
  - `run_id`
  - `target_model`
  - worker HF credential
 - dashboard errors are now explicit when the managed endpoint completes without an adoptable artifact
 ## Live Deployment Status
 Deployed live to Erik:
 - rebuilt and rsynced dashboard server
 - synced updated custom worker source files into repo state on Erik
 - restarted `pm2 magatama-dashboard`
 Not yet completed in infrastructure:
 - the active RunPod endpoint itself is still the managed Axolotl endpoint
 ## Required Final Infra Step
 To get true full automation:
 1. build/publish:
   - `magatama/packages/fine-tuner/Dockerfile.runpod`
 2. create or switch to a custom RunPod serverless endpoint running:
   - `runpod_handler.py`
 3. set on Erik:
   - `RUNPOD_WORKER_KIND=custom-magatama`
   - `RUNPOD_ENDPOINT_ID=<custom-endpoint-id>`
 Only then will MAGATAMA be able to:
 - pull the lane-specific training pool
 - train on RunPod
 - get back a real adapter artifact
 - adopt it locally into Ollama
 - write a new version number
 - repoint the active alias after smoke tests