sync: record runpod managed endpoint root cause

This commit is contained in:
Rene Fichtmueller 2026-05-07 10:47:57 +02:00
parent 21b56ead81
commit 2a3576135c
2 changed files with 235 additions and 0 deletions

View File

@ -2,6 +2,94 @@
Updated: 2026-05-07 08:05 UTC
## Newest Work
- MAGATAMA RunPod training return-path deep dive on 2026-05-07:
- Attack Paths `Open Fix Guidance` placebo button was fixed live on Erik:
- `magatama/packages/dashboard/public/index-v2.html`
- real behavior now:
- if graph node maps to a real finding, open the existing ticket/finding drawer
- if node is only synthetic, show an explicit warning instead of doing nothing
- deployed to:
- `/opt/magatama/packages/dashboard/public/index-v2.html`
- `pm2 restart magatama-dashboard` executed
- local Mac train API truth rechecked:
- `GET http://127.0.0.1:3214/health`
- returns `status = ok`
- service is idle/reachable, not broken
- RunPod heartbeat/UI stream issue was fixed live:
- dashboard server now emits keepalive progress messages during:
- long `IN_PROGRESS` phases
- post-`COMPLETED` artifact verification loops
- deployed live to Erik dashboard
- direct raw RunPod status canary against the current endpoint (`dheii186pfcuq7`) was executed:
- tiny 1-step `tip_llm` canary job:
- `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`
- observed raw status sequence:
- `IN_QUEUE`
- `IN_PROGRESS`
- `COMPLETED`
- **critical truth**:
- `/status/{job}` returned no `output`
- `/stream/{job}` returned:
- `{"status":"COMPLETED","stream":[]}`
- interpretation:
- the currently configured endpoint is the managed Axolotl serverless endpoint
- it does not return a programmatically adoptable artifact reference to MAGATAMA
- this is why all lanes keep ending in:
- `completed_without_model_artifact`
- Erik secrets reality rechecked:
- `/opt/magatama/secrets/hf-token` exists and is readable by the running process
- therefore the current failure is **not** caused by a missing HF token on Erik
- root cause now considered confirmed:
- the **managed Axolotl serverless endpoint** is acceptable for queueing/running a fine-tune
- but not sufficient for MAGATAMA's required full automation:
- train
- return explicit artifact
- adopt locally
- smoke-test
- create new release alias
- switch active alias
- code path for the correct architecture is now prepared:
- `magatama/packages/fine-tuner/runpod_handler.py`
- `magatama/packages/fine-tuner/train_cuda.py`
- `magatama/packages/fine-tuner/requirements-runpod.txt`
- `magatama/packages/dashboard/src/server.ts`
- what changed in that path:
- custom RunPod worker now accepts:
- `target_model`
- `credentials.hf_token`
- training script now:
- trains lane-specific bundle
- uploads the resulting adapter folder to Hugging Face
- returns `adapter_repo_id`
- dashboard custom-worker submit path now includes:
- `run_id`
- `target_model`
- HF credential pass-through for the worker
- dashboard error text is now explicit:
- if the managed Axolotl endpoint completes without an adoptable artifact, MAGATAMA says so plainly and points at the need for the `custom-magatama` worker
- live deployment status:
- updated dashboard server was rebuilt and deployed to Erik
- updated custom worker source files were synced into Erik repo state
- BUT:
- the currently active RunPod endpoint is still the managed Axolotl endpoint
- the new full return-path logic will only become effective once the RunPod endpoint is switched to the custom MAGATAMA worker image
- operational conclusion:
- training pool refresh, lane separation, submit flow, and local adoption API are now in good shape
- the final missing infrastructure step is:
- build/publish `packages/fine-tuner/Dockerfile.runpod`
- create/use a custom RunPod serverless endpoint for `runpod_handler.py`
- set:
- `RUNPOD_WORKER_KIND=custom-magatama`
- `RUNPOD_ENDPOINT_ID=<custom-endpoint>`
- only then can MAGATAMA honestly achieve:
- automatic training
- automatic artifact return
- automatic adoption
- automatic version bump
- automatic alias switch after smoke tests
## Active Policy
- Put coordination notes and handoffs in this `sync/` folder and push to Gitea.

View File

@ -0,0 +1,147 @@
# 2026-05-07 MAGATAMA RunPod Managed-Endpoint Root Cause and Custom Worker Path
## Summary
We continued the MAGATAMA RunPod training automation investigation live and closed the remaining ambiguity.
The current RunPod endpoint (`dheii186pfcuq7`) is the managed Axolotl serverless endpoint. It accepts jobs and reports lifecycle states such as `IN_QUEUE`, `IN_PROGRESS`, and `COMPLETED`, but it does **not** return a programmatically adoptable model artifact back to MAGATAMA.
That means:
- dataset refresh works
- lane-specific exports work
- training submit works
- local adoption API is healthy
But the full automation chain still breaks on the return path.
## Live Findings
### Attack Paths fix guidance
- `Open Fix Guidance` on Attack Paths was a placebo button.
- Fixed in:
- `magatama/packages/dashboard/public/index-v2.html`
- Live behavior now:
- opens the real finding/ticket drawer when the graph node maps to a finding
- otherwise shows an explicit warning
### Local train API rechecked
- `GET http://127.0.0.1:3214/health`
- result:
- `status = ok`
- service reachable
- service idle
Conclusion:
- local adoption/import service is not the current blocker
### RunPod raw status canary
A tiny direct canary was executed against the same endpoint:
- lane: `tip_llm`
- steps: `1`
- job:
- `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`
Observed via raw `/status/{job}` polling:
- `IN_QUEUE`
- `IN_PROGRESS`
- `COMPLETED`
Critical detail:
- `/status/{job}` had no `output`
- `/stream/{job}` returned:
- `{"status":"COMPLETED","stream":[]}`
This confirms that the current endpoint does not hand MAGATAMA the explicit artifact metadata it needs for automatic adoption.
### HF token check
Erik was checked directly:
- `/opt/magatama/secrets/hf-token`
- exists
- readable
Conclusion:
- the current failure is not a missing Hugging Face token on Erik
## Root Cause
The managed Axolotl serverless endpoint is not enough for MAGATAMA's required end-to-end automation.
MAGATAMA needs a worker that can:
1. train the lane-specific dataset
2. upload the resulting adapter/model artifact explicitly
3. return a machine-readable artifact reference
4. let MAGATAMA adopt/import that artifact
5. run smoke tests
6. bump version
7. switch the active alias
The managed Axolotl endpoint currently only gives lifecycle state, not an adoptable return artifact.
## Code Completed
Prepared the correct custom-worker path in:
- `magatama/packages/fine-tuner/train_cuda.py`
- `magatama/packages/fine-tuner/runpod_handler.py`
- `magatama/packages/fine-tuner/requirements-runpod.txt`
- `magatama/packages/dashboard/src/server.ts`
### What changed
- custom RunPod worker input now supports:
- `target_model`
- `credentials.hf_token`
- `train_cuda.py` now:
- trains from the signed MAGATAMA lane bundle
- uploads the resulting adapter folder to Hugging Face
- returns `adapter_repo_id`
- dashboard custom-worker submit path now sends:
- `run_id`
- `target_model`
- worker HF credential
- dashboard errors are now explicit when the managed endpoint completes without an adoptable artifact
## Live Deployment Status
Deployed live to Erik:
- rebuilt and rsynced dashboard server
- synced updated custom worker source files into repo state on Erik
- restarted `pm2 magatama-dashboard`
Not yet completed in infrastructure:
- the active RunPod endpoint itself is still the managed Axolotl endpoint
## Required Final Infra Step
To get true full automation:
1. build/publish:
- `magatama/packages/fine-tuner/Dockerfile.runpod`
2. create or switch to a custom RunPod serverless endpoint running:
- `runpod_handler.py`
3. set on Erik:
- `RUNPOD_WORKER_KIND=custom-magatama`
- `RUNPOD_ENDPOINT_ID=<custom-endpoint-id>`
Only then will MAGATAMA be able to:
- pull the lane-specific training pool
- train on RunPod
- get back a real adapter artifact
- adopt it locally into Ollama
- write a new version number
- repoint the active alias after smoke tests