sync: record runpod managed endpoint root cause
This commit is contained in:
parent
21b56ead81
commit
2a3576135c
@ -2,6 +2,94 @@
|
|||||||
|
|
||||||
Updated: 2026-05-07 08:05 UTC
|
Updated: 2026-05-07 08:05 UTC
|
||||||
|
|
||||||
|
## Newest Work
|
||||||
|
|
||||||
|
- MAGATAMA RunPod training return-path deep dive on 2026-05-07:
|
||||||
|
- Attack Paths `Open Fix Guidance` placebo button was fixed live on Erik:
|
||||||
|
- `magatama/packages/dashboard/public/index-v2.html`
|
||||||
|
- real behavior now:
|
||||||
|
- if graph node maps to a real finding, open the existing ticket/finding drawer
|
||||||
|
- if node is only synthetic, show an explicit warning instead of doing nothing
|
||||||
|
- deployed to:
|
||||||
|
- `/opt/magatama/packages/dashboard/public/index-v2.html`
|
||||||
|
- `pm2 restart magatama-dashboard` executed
|
||||||
|
- local Mac train API truth rechecked:
|
||||||
|
- `GET http://127.0.0.1:3214/health`
|
||||||
|
- returns `status = ok`
|
||||||
|
- service is idle/reachable, not broken
|
||||||
|
- RunPod heartbeat/UI stream issue was fixed live:
|
||||||
|
- dashboard server now emits keepalive progress messages during:
|
||||||
|
- long `IN_PROGRESS` phases
|
||||||
|
- post-`COMPLETED` artifact verification loops
|
||||||
|
- deployed live to Erik dashboard
|
||||||
|
- direct raw RunPod status canary against the current endpoint (`dheii186pfcuq7`) was executed:
|
||||||
|
- tiny 1-step `tip_llm` canary job:
|
||||||
|
- `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`
|
||||||
|
- observed raw status sequence:
|
||||||
|
- `IN_QUEUE`
|
||||||
|
- `IN_PROGRESS`
|
||||||
|
- `COMPLETED`
|
||||||
|
- **critical truth**:
|
||||||
|
- `/status/{job}` returned no `output`
|
||||||
|
- `/stream/{job}` returned:
|
||||||
|
- `{"status":"COMPLETED","stream":[]}`
|
||||||
|
- interpretation:
|
||||||
|
- the currently configured endpoint is the managed Axolotl serverless endpoint
|
||||||
|
- it does not return a programmatically adoptable artifact reference to MAGATAMA
|
||||||
|
- this is why all lanes keep ending in:
|
||||||
|
- `completed_without_model_artifact`
|
||||||
|
- Erik secrets reality rechecked:
|
||||||
|
- `/opt/magatama/secrets/hf-token` exists and is readable by the running process
|
||||||
|
- therefore the current failure is **not** caused by a missing HF token on Erik
|
||||||
|
- root cause now considered confirmed:
|
||||||
|
- the **managed Axolotl serverless endpoint** is acceptable for queueing/running a fine-tune
|
||||||
|
- but not sufficient for MAGATAMA's required full automation:
|
||||||
|
- train
|
||||||
|
- return explicit artifact
|
||||||
|
- adopt locally
|
||||||
|
- smoke-test
|
||||||
|
- create new release alias
|
||||||
|
- switch active alias
|
||||||
|
- code path for the correct architecture is now prepared:
|
||||||
|
- `magatama/packages/fine-tuner/runpod_handler.py`
|
||||||
|
- `magatama/packages/fine-tuner/train_cuda.py`
|
||||||
|
- `magatama/packages/fine-tuner/requirements-runpod.txt`
|
||||||
|
- `magatama/packages/dashboard/src/server.ts`
|
||||||
|
- what changed in that path:
|
||||||
|
- custom RunPod worker now accepts:
|
||||||
|
- `target_model`
|
||||||
|
- `credentials.hf_token`
|
||||||
|
- training script now:
|
||||||
|
- trains lane-specific bundle
|
||||||
|
- uploads the resulting adapter folder to Hugging Face
|
||||||
|
- returns `adapter_repo_id`
|
||||||
|
- dashboard custom-worker submit path now includes:
|
||||||
|
- `run_id`
|
||||||
|
- `target_model`
|
||||||
|
- HF credential pass-through for the worker
|
||||||
|
- dashboard error text is now explicit:
|
||||||
|
- if the managed Axolotl endpoint completes without an adoptable artifact, MAGATAMA says so plainly and points at the need for the `custom-magatama` worker
|
||||||
|
- live deployment status:
|
||||||
|
- updated dashboard server was rebuilt and deployed to Erik
|
||||||
|
- updated custom worker source files were synced into Erik repo state
|
||||||
|
- BUT:
|
||||||
|
- the currently active RunPod endpoint is still the managed Axolotl endpoint
|
||||||
|
- the new full return-path logic will only become effective once the RunPod endpoint is switched to the custom MAGATAMA worker image
|
||||||
|
- operational conclusion:
|
||||||
|
- training pool refresh, lane separation, submit flow, and local adoption API are now in good shape
|
||||||
|
- the final missing infrastructure step is:
|
||||||
|
- build/publish `packages/fine-tuner/Dockerfile.runpod`
|
||||||
|
- create/use a custom RunPod serverless endpoint for `runpod_handler.py`
|
||||||
|
- set:
|
||||||
|
- `RUNPOD_WORKER_KIND=custom-magatama`
|
||||||
|
- `RUNPOD_ENDPOINT_ID=<custom-endpoint>`
|
||||||
|
- only then can MAGATAMA honestly achieve:
|
||||||
|
- automatic training
|
||||||
|
- automatic artifact return
|
||||||
|
- automatic adoption
|
||||||
|
- automatic version bump
|
||||||
|
- automatic alias switch after smoke tests
|
||||||
|
|
||||||
## Active Policy
|
## Active Policy
|
||||||
|
|
||||||
- Put coordination notes and handoffs in this `sync/` folder and push to Gitea.
|
- Put coordination notes and handoffs in this `sync/` folder and push to Gitea.
|
||||||
|
|||||||
@ -0,0 +1,147 @@
|
|||||||
|
# 2026-05-07 MAGATAMA RunPod Managed-Endpoint Root Cause and Custom Worker Path
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
We continued the MAGATAMA RunPod training automation investigation live and closed the remaining ambiguity.
|
||||||
|
|
||||||
|
The current RunPod endpoint (`dheii186pfcuq7`) is the managed Axolotl serverless endpoint. It accepts jobs and reports lifecycle states such as `IN_QUEUE`, `IN_PROGRESS`, and `COMPLETED`, but it does **not** return a programmatically adoptable model artifact back to MAGATAMA.
|
||||||
|
|
||||||
|
That means:
|
||||||
|
|
||||||
|
- dataset refresh works
|
||||||
|
- lane-specific exports work
|
||||||
|
- training submit works
|
||||||
|
- local adoption API is healthy
|
||||||
|
|
||||||
|
But the full automation chain still breaks on the return path.
|
||||||
|
|
||||||
|
## Live Findings
|
||||||
|
|
||||||
|
### Attack Paths fix guidance
|
||||||
|
|
||||||
|
- `Open Fix Guidance` on Attack Paths was a placebo button.
|
||||||
|
- Fixed in:
|
||||||
|
- `magatama/packages/dashboard/public/index-v2.html`
|
||||||
|
- Live behavior now:
|
||||||
|
- opens the real finding/ticket drawer when the graph node maps to a finding
|
||||||
|
- otherwise shows an explicit warning
|
||||||
|
|
||||||
|
### Local train API rechecked
|
||||||
|
|
||||||
|
- `GET http://127.0.0.1:3214/health`
|
||||||
|
- result:
|
||||||
|
- `status = ok`
|
||||||
|
- service reachable
|
||||||
|
- service idle
|
||||||
|
|
||||||
|
Conclusion:
|
||||||
|
|
||||||
|
- local adoption/import service is not the current blocker
|
||||||
|
|
||||||
|
### RunPod raw status canary
|
||||||
|
|
||||||
|
A tiny direct canary was executed against the same endpoint:
|
||||||
|
|
||||||
|
- lane: `tip_llm`
|
||||||
|
- steps: `1`
|
||||||
|
- job:
|
||||||
|
- `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`
|
||||||
|
|
||||||
|
Observed via raw `/status/{job}` polling:
|
||||||
|
|
||||||
|
- `IN_QUEUE`
|
||||||
|
- `IN_PROGRESS`
|
||||||
|
- `COMPLETED`
|
||||||
|
|
||||||
|
Critical detail:
|
||||||
|
|
||||||
|
- `/status/{job}` had no `output`
|
||||||
|
- `/stream/{job}` returned:
|
||||||
|
- `{"status":"COMPLETED","stream":[]}`
|
||||||
|
|
||||||
|
This confirms that the current endpoint does not hand MAGATAMA the explicit artifact metadata it needs for automatic adoption.
|
||||||
|
|
||||||
|
### HF token check
|
||||||
|
|
||||||
|
Erik was checked directly:
|
||||||
|
|
||||||
|
- `/opt/magatama/secrets/hf-token`
|
||||||
|
- exists
|
||||||
|
- readable
|
||||||
|
|
||||||
|
Conclusion:
|
||||||
|
|
||||||
|
- the current failure is not a missing Hugging Face token on Erik
|
||||||
|
|
||||||
|
## Root Cause
|
||||||
|
|
||||||
|
The managed Axolotl serverless endpoint is not enough for MAGATAMA's required end-to-end automation.
|
||||||
|
|
||||||
|
MAGATAMA needs a worker that can:
|
||||||
|
|
||||||
|
1. train the lane-specific dataset
|
||||||
|
2. upload the resulting adapter/model artifact explicitly
|
||||||
|
3. return a machine-readable artifact reference
|
||||||
|
4. let MAGATAMA adopt/import that artifact
|
||||||
|
5. run smoke tests
|
||||||
|
6. bump version
|
||||||
|
7. switch the active alias
|
||||||
|
|
||||||
|
The managed Axolotl endpoint currently only gives lifecycle state, not an adoptable return artifact.
|
||||||
|
|
||||||
|
## Code Completed
|
||||||
|
|
||||||
|
Prepared the correct custom-worker path in:
|
||||||
|
|
||||||
|
- `magatama/packages/fine-tuner/train_cuda.py`
|
||||||
|
- `magatama/packages/fine-tuner/runpod_handler.py`
|
||||||
|
- `magatama/packages/fine-tuner/requirements-runpod.txt`
|
||||||
|
- `magatama/packages/dashboard/src/server.ts`
|
||||||
|
|
||||||
|
### What changed
|
||||||
|
|
||||||
|
- custom RunPod worker input now supports:
|
||||||
|
- `target_model`
|
||||||
|
- `credentials.hf_token`
|
||||||
|
- `train_cuda.py` now:
|
||||||
|
- trains from the signed MAGATAMA lane bundle
|
||||||
|
- uploads the resulting adapter folder to Hugging Face
|
||||||
|
- returns `adapter_repo_id`
|
||||||
|
- dashboard custom-worker submit path now sends:
|
||||||
|
- `run_id`
|
||||||
|
- `target_model`
|
||||||
|
- worker HF credential
|
||||||
|
- dashboard errors are now explicit when the managed endpoint completes without an adoptable artifact
|
||||||
|
|
||||||
|
## Live Deployment Status
|
||||||
|
|
||||||
|
Deployed live to Erik:
|
||||||
|
|
||||||
|
- rebuilt and rsynced dashboard server
|
||||||
|
- synced updated custom worker source files into repo state on Erik
|
||||||
|
- restarted `pm2 magatama-dashboard`
|
||||||
|
|
||||||
|
Not yet completed in infrastructure:
|
||||||
|
|
||||||
|
- the active RunPod endpoint itself is still the managed Axolotl endpoint
|
||||||
|
|
||||||
|
## Required Final Infra Step
|
||||||
|
|
||||||
|
To get true full automation:
|
||||||
|
|
||||||
|
1. build/publish:
|
||||||
|
- `magatama/packages/fine-tuner/Dockerfile.runpod`
|
||||||
|
2. create or switch to a custom RunPod serverless endpoint running:
|
||||||
|
- `runpod_handler.py`
|
||||||
|
3. set on Erik:
|
||||||
|
- `RUNPOD_WORKER_KIND=custom-magatama`
|
||||||
|
- `RUNPOD_ENDPOINT_ID=<custom-endpoint-id>`
|
||||||
|
|
||||||
|
Only then will MAGATAMA be able to:
|
||||||
|
|
||||||
|
- pull the lane-specific training pool
|
||||||
|
- train on RunPod
|
||||||
|
- get back a real adapter artifact
|
||||||
|
- adopt it locally into Ollama
|
||||||
|
- write a new version number
|
||||||
|
- repoint the active alias after smoke tests
|
||||||
Loading…
x
Reference in New Issue
Block a user