sync: record runpod managed endpoint root cause
This commit is contained in:
parent
21b56ead81
commit
2a3576135c
@ -2,6 +2,94 @@
|
||||
|
||||
Updated: 2026-05-07 08:05 UTC
|
||||
|
||||
## Newest Work
|
||||
|
||||
- MAGATAMA RunPod training return-path deep dive on 2026-05-07:
|
||||
- Attack Paths `Open Fix Guidance` placebo button was fixed live on Erik:
|
||||
- `magatama/packages/dashboard/public/index-v2.html`
|
||||
- real behavior now:
|
||||
- if graph node maps to a real finding, open the existing ticket/finding drawer
|
||||
- if node is only synthetic, show an explicit warning instead of doing nothing
|
||||
- deployed to:
|
||||
- `/opt/magatama/packages/dashboard/public/index-v2.html`
|
||||
- `pm2 restart magatama-dashboard` executed
|
||||
- local Mac train API truth rechecked:
|
||||
- `GET http://127.0.0.1:3214/health`
|
||||
- returns `status = ok`
|
||||
- service is idle/reachable, not broken
|
||||
- RunPod heartbeat/UI stream issue was fixed live:
|
||||
- dashboard server now emits keepalive progress messages during:
|
||||
- long `IN_PROGRESS` phases
|
||||
- post-`COMPLETED` artifact verification loops
|
||||
- deployed live to Erik dashboard
|
||||
- direct raw RunPod status canary against the current endpoint (`dheii186pfcuq7`) was executed:
|
||||
- tiny 1-step `tip_llm` canary job:
|
||||
- `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`
|
||||
- observed raw status sequence:
|
||||
- `IN_QUEUE`
|
||||
- `IN_PROGRESS`
|
||||
- `COMPLETED`
|
||||
- **critical truth**:
|
||||
- `/status/{job}` returned no `output`
|
||||
- `/stream/{job}` returned:
|
||||
- `{"status":"COMPLETED","stream":[]}`
|
||||
- interpretation:
|
||||
- the currently configured endpoint is the managed Axolotl serverless endpoint
|
||||
- it does not return a programmatically adoptable artifact reference to MAGATAMA
|
||||
- this is why all lanes keep ending in:
|
||||
- `completed_without_model_artifact`
|
||||
- Erik secrets reality rechecked:
|
||||
- `/opt/magatama/secrets/hf-token` exists and is readable by the running process
|
||||
- therefore the current failure is **not** caused by a missing HF token on Erik
|
||||
- root cause now considered confirmed:
|
||||
- the **managed Axolotl serverless endpoint** is acceptable for queueing/running a fine-tune
|
||||
- but not sufficient for MAGATAMA's required full automation:
|
||||
- train
|
||||
- return explicit artifact
|
||||
- adopt locally
|
||||
- smoke-test
|
||||
- create new release alias
|
||||
- switch active alias
|
||||
- code path for the correct architecture is now prepared:
|
||||
- `magatama/packages/fine-tuner/runpod_handler.py`
|
||||
- `magatama/packages/fine-tuner/train_cuda.py`
|
||||
- `magatama/packages/fine-tuner/requirements-runpod.txt`
|
||||
- `magatama/packages/dashboard/src/server.ts`
|
||||
- what changed in that path:
|
||||
- custom RunPod worker now accepts:
|
||||
- `target_model`
|
||||
- `credentials.hf_token`
|
||||
- training script now:
|
||||
- trains lane-specific bundle
|
||||
- uploads the resulting adapter folder to Hugging Face
|
||||
- returns `adapter_repo_id`
|
||||
- dashboard custom-worker submit path now includes:
|
||||
- `run_id`
|
||||
- `target_model`
|
||||
- HF credential pass-through for the worker
|
||||
- dashboard error text is now explicit:
|
||||
- if the managed Axolotl endpoint completes without an adoptable artifact, MAGATAMA says so plainly and points at the need for the `custom-magatama` worker
|
||||
- live deployment status:
|
||||
- updated dashboard server was rebuilt and deployed to Erik
|
||||
- updated custom worker source files were synced into Erik repo state
|
||||
- BUT:
|
||||
- the currently active RunPod endpoint is still the managed Axolotl endpoint
|
||||
- the new full return-path logic will only become effective once the RunPod endpoint is switched to the custom MAGATAMA worker image
|
||||
- operational conclusion:
|
||||
- training pool refresh, lane separation, submit flow, and local adoption API are now in good shape
|
||||
- the final missing infrastructure step is:
|
||||
- build/publish `packages/fine-tuner/Dockerfile.runpod`
|
||||
- create/use a custom RunPod serverless endpoint for `runpod_handler.py`
|
||||
- set:
|
||||
- `RUNPOD_WORKER_KIND=custom-magatama`
|
||||
- `RUNPOD_ENDPOINT_ID=<custom-endpoint>`
|
||||
- only then can MAGATAMA honestly achieve:
|
||||
- automatic training
|
||||
- automatic artifact return
|
||||
- automatic adoption
|
||||
- automatic version bump
|
||||
- automatic alias switch after smoke tests
|
||||
|
||||
## Active Policy
|
||||
|
||||
- Put coordination notes and handoffs in this `sync/` folder and push to Gitea.
|
||||
|
||||
@ -0,0 +1,147 @@
|
||||
# 2026-05-07 MAGATAMA RunPod Managed-Endpoint Root Cause and Custom Worker Path
|
||||
|
||||
## Summary
|
||||
|
||||
We continued the MAGATAMA RunPod training automation investigation live and closed the remaining ambiguity.
|
||||
|
||||
The current RunPod endpoint (`dheii186pfcuq7`) is the managed Axolotl serverless endpoint. It accepts jobs and reports lifecycle states such as `IN_QUEUE`, `IN_PROGRESS`, and `COMPLETED`, but it does **not** return a programmatically adoptable model artifact back to MAGATAMA.
|
||||
|
||||
That means:
|
||||
|
||||
- dataset refresh works
|
||||
- lane-specific exports work
|
||||
- training submit works
|
||||
- local adoption API is healthy
|
||||
|
||||
But the full automation chain still breaks on the return path.
|
||||
|
||||
## Live Findings
|
||||
|
||||
### Attack Paths fix guidance
|
||||
|
||||
- `Open Fix Guidance` on Attack Paths was a placebo button.
|
||||
- Fixed in:
|
||||
- `magatama/packages/dashboard/public/index-v2.html`
|
||||
- Live behavior now:
|
||||
- opens the real finding/ticket drawer when the graph node maps to a finding
|
||||
- otherwise shows an explicit warning
|
||||
|
||||
### Local train API rechecked
|
||||
|
||||
- `GET http://127.0.0.1:3214/health`
|
||||
- result:
|
||||
- `status = ok`
|
||||
- service reachable
|
||||
- service idle
|
||||
|
||||
Conclusion:
|
||||
|
||||
- local adoption/import service is not the current blocker
|
||||
|
||||
### RunPod raw status canary
|
||||
|
||||
A tiny direct canary was executed against the same endpoint:
|
||||
|
||||
- lane: `tip_llm`
|
||||
- steps: `1`
|
||||
- job:
|
||||
- `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`
|
||||
|
||||
Observed via raw `/status/{job}` polling:
|
||||
|
||||
- `IN_QUEUE`
|
||||
- `IN_PROGRESS`
|
||||
- `COMPLETED`
|
||||
|
||||
Critical detail:
|
||||
|
||||
- `/status/{job}` had no `output`
|
||||
- `/stream/{job}` returned:
|
||||
- `{"status":"COMPLETED","stream":[]}`
|
||||
|
||||
This confirms that the current endpoint does not hand MAGATAMA the explicit artifact metadata it needs for automatic adoption.
|
||||
|
||||
### HF token check
|
||||
|
||||
Erik was checked directly:
|
||||
|
||||
- `/opt/magatama/secrets/hf-token`
|
||||
- exists
|
||||
- readable
|
||||
|
||||
Conclusion:
|
||||
|
||||
- the current failure is not a missing Hugging Face token on Erik
|
||||
|
||||
## Root Cause
|
||||
|
||||
The managed Axolotl serverless endpoint is not enough for MAGATAMA's required end-to-end automation.
|
||||
|
||||
MAGATAMA needs a worker that can:
|
||||
|
||||
1. train the lane-specific dataset
|
||||
2. upload the resulting adapter/model artifact explicitly
|
||||
3. return a machine-readable artifact reference
|
||||
4. let MAGATAMA adopt/import that artifact
|
||||
5. run smoke tests
|
||||
6. bump version
|
||||
7. switch the active alias
|
||||
|
||||
The managed Axolotl endpoint currently only gives lifecycle state, not an adoptable return artifact.
|
||||
|
||||
## Code Completed
|
||||
|
||||
Prepared the correct custom-worker path in:
|
||||
|
||||
- `magatama/packages/fine-tuner/train_cuda.py`
|
||||
- `magatama/packages/fine-tuner/runpod_handler.py`
|
||||
- `magatama/packages/fine-tuner/requirements-runpod.txt`
|
||||
- `magatama/packages/dashboard/src/server.ts`
|
||||
|
||||
### What changed
|
||||
|
||||
- custom RunPod worker input now supports:
|
||||
- `target_model`
|
||||
- `credentials.hf_token`
|
||||
- `train_cuda.py` now:
|
||||
- trains from the signed MAGATAMA lane bundle
|
||||
- uploads the resulting adapter folder to Hugging Face
|
||||
- returns `adapter_repo_id`
|
||||
- dashboard custom-worker submit path now sends:
|
||||
- `run_id`
|
||||
- `target_model`
|
||||
- worker HF credential
|
||||
- dashboard errors are now explicit when the managed endpoint completes without an adoptable artifact
|
||||
|
||||
## Live Deployment Status
|
||||
|
||||
Deployed live to Erik:
|
||||
|
||||
- rebuilt and rsynced dashboard server
|
||||
- synced updated custom worker source files into repo state on Erik
|
||||
- restarted `pm2 magatama-dashboard`
|
||||
|
||||
Not yet completed in infrastructure:
|
||||
|
||||
- the active RunPod endpoint itself is still the managed Axolotl endpoint
|
||||
|
||||
## Required Final Infra Step
|
||||
|
||||
To get true full automation:
|
||||
|
||||
1. build/publish:
|
||||
- `magatama/packages/fine-tuner/Dockerfile.runpod`
|
||||
2. create or switch to a custom RunPod serverless endpoint running:
|
||||
- `runpod_handler.py`
|
||||
3. set on Erik:
|
||||
- `RUNPOD_WORKER_KIND=custom-magatama`
|
||||
- `RUNPOD_ENDPOINT_ID=<custom-endpoint-id>`
|
||||
|
||||
Only then will MAGATAMA be able to:
|
||||
|
||||
- pull the lane-specific training pool
|
||||
- train on RunPod
|
||||
- get back a real adapter artifact
|
||||
- adopt it locally into Ollama
|
||||
- write a new version number
|
||||
- repoint the active alias after smoke tests
|
||||
Loading…
x
Reference in New Issue
Block a user