sync: record custom runpod worker build prep
This commit is contained in:
parent
2a3576135c
commit
72d61add47
@ -4,6 +4,75 @@ Updated: 2026-05-07 08:05 UTC
|
|||||||
|
|
||||||
## Newest Work
|
## Newest Work
|
||||||
|
|
||||||
|
- MAGATAMA RunPod custom worker preparation continued on 2026-05-07:
|
||||||
|
- the pending sync handoff was committed and **successfully pushed to Gitea**:
|
||||||
|
- commit:
|
||||||
|
- `2a35761 sync: record runpod managed endpoint root cause`
|
||||||
|
- MAGATAMA repo now includes an explicit helper for building/publishing the custom RunPod worker image:
|
||||||
|
- `magatama/scripts/runpod_worker_publish.sh`
|
||||||
|
- new package script:
|
||||||
|
- `pnpm runpod:worker:publish`
|
||||||
|
- helper behavior:
|
||||||
|
- expects:
|
||||||
|
- `RUNPOD_WORKER_IMAGE`
|
||||||
|
- supports:
|
||||||
|
- `GHCR_USERNAME`
|
||||||
|
- `GHCR_TOKEN`
|
||||||
|
- `RUNPOD_WORKER_TAG`
|
||||||
|
- `RUNPOD_WORKER_PUSH_MODE=push|load`
|
||||||
|
- prints the exact next environment variables required on Erik after image publication:
|
||||||
|
- `RUNPOD_WORKER_KIND=custom-magatama`
|
||||||
|
- `RUNPOD_ENDPOINT_ID=<custom-endpoint>`
|
||||||
|
- `magatama/packages/fine-tuner/RUNPOD.md` was extended so the full automation target is now documented end-to-end:
|
||||||
|
- lane pool sync
|
||||||
|
- RunPod dataset URL bundle
|
||||||
|
- custom worker training
|
||||||
|
- adapter upload
|
||||||
|
- local adoption
|
||||||
|
- smoke tests
|
||||||
|
- release alias minting
|
||||||
|
- active alias switch
|
||||||
|
- Erik infrastructure truth was rechecked:
|
||||||
|
- `docker` exists:
|
||||||
|
- `/usr/bin/docker`
|
||||||
|
- `docker buildx` exists:
|
||||||
|
- `github.com/docker/buildx v0.33.0`
|
||||||
|
- **no docker registry login/config** is currently present on Erik:
|
||||||
|
- `~/.docker/config.json` absent
|
||||||
|
- interpretation:
|
||||||
|
- Erik can build images
|
||||||
|
- but cannot yet push a public/private worker image to GHCR/Docker Hub without credentials or a pre-authenticated registry path
|
||||||
|
- the missing custom worker files were synced live to Erik:
|
||||||
|
- `/opt/magatama/packages/fine-tuner/Dockerfile.runpod`
|
||||||
|
- `/opt/magatama/packages/fine-tuner/RUNPOD.md`
|
||||||
|
- a real remote worker image build was then attempted on Erik:
|
||||||
|
- image tag requested:
|
||||||
|
- `magatama-runpod-worker:test`
|
||||||
|
- build truth:
|
||||||
|
- base `runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04` pulled successfully
|
||||||
|
- Python dependencies for the worker installed successfully
|
||||||
|
- build reached:
|
||||||
|
- `COPY train_cuda.py runpod_handler.py ./`
|
||||||
|
- `exporting to image`
|
||||||
|
- however:
|
||||||
|
- final image was **not yet visible** in `docker images`
|
||||||
|
- therefore the build still needs one more clean verification pass before being treated as green
|
||||||
|
- current operational conclusion:
|
||||||
|
- MAGATAMA training pools, lane separation, signed dataset URL path, and local adoption API are ready
|
||||||
|
- the final blocking step remains infrastructure:
|
||||||
|
- publish the custom worker image to a registry RunPod can consume
|
||||||
|
- create/switch the endpoint
|
||||||
|
- then set on Erik:
|
||||||
|
- `RUNPOD_WORKER_KIND=custom-magatama`
|
||||||
|
- `RUNPOD_ENDPOINT_ID=<custom endpoint id>`
|
||||||
|
- once that is done, MAGATAMA's already-prepared code path can finally perform:
|
||||||
|
- train
|
||||||
|
- verify artifact
|
||||||
|
- adopt locally
|
||||||
|
- smoke-test
|
||||||
|
- bump version
|
||||||
|
- switch alias
|
||||||
|
|
||||||
- MAGATAMA RunPod training return-path deep dive on 2026-05-07:
|
- MAGATAMA RunPod training return-path deep dive on 2026-05-07:
|
||||||
- Attack Paths `Open Fix Guidance` placebo button was fixed live on Erik:
|
- Attack Paths `Open Fix Guidance` placebo button was fixed live on Erik:
|
||||||
- `magatama/packages/dashboard/public/index-v2.html`
|
- `magatama/packages/dashboard/public/index-v2.html`
|
||||||
|
|||||||
@ -0,0 +1,77 @@
|
|||||||
|
# MAGATAMA Custom RunPod Worker Build/Publish Prep
|
||||||
|
|
||||||
|
Date: 2026-05-07
|
||||||
|
|
||||||
|
## What Changed
|
||||||
|
|
||||||
|
- committed and pushed the previously pending RunPod root-cause sync handoff:
|
||||||
|
- `2a35761 sync: record runpod managed endpoint root cause`
|
||||||
|
- added a real custom-worker build/publish helper to MAGATAMA:
|
||||||
|
- `magatama/scripts/runpod_worker_publish.sh`
|
||||||
|
- added package entrypoint:
|
||||||
|
- `pnpm runpod:worker:publish`
|
||||||
|
- extended:
|
||||||
|
- `magatama/packages/fine-tuner/RUNPOD.md`
|
||||||
|
so the target end-to-end automation path is documented from lane pool through alias switch
|
||||||
|
|
||||||
|
## Erik Reality Check
|
||||||
|
|
||||||
|
- `docker` exists on Erik:
|
||||||
|
- `/usr/bin/docker`
|
||||||
|
- `docker buildx` exists:
|
||||||
|
- `github.com/docker/buildx v0.33.0`
|
||||||
|
- no preexisting docker registry login/config found:
|
||||||
|
- `~/.docker/config.json` absent
|
||||||
|
|
||||||
|
Interpretation:
|
||||||
|
|
||||||
|
- Erik can act as a builder
|
||||||
|
- but cannot yet publish a worker image to GHCR/Docker Hub without credentials or a registry login
|
||||||
|
|
||||||
|
## Live Remote Worker Build Attempt
|
||||||
|
|
||||||
|
Synced to Erik:
|
||||||
|
|
||||||
|
- `/opt/magatama/packages/fine-tuner/Dockerfile.runpod`
|
||||||
|
- `/opt/magatama/packages/fine-tuner/RUNPOD.md`
|
||||||
|
|
||||||
|
Then attempted:
|
||||||
|
|
||||||
|
- build image tag:
|
||||||
|
- `magatama-runpod-worker:test`
|
||||||
|
|
||||||
|
Observed build truth:
|
||||||
|
|
||||||
|
- base `runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04` pulled successfully
|
||||||
|
- worker dependencies installed successfully
|
||||||
|
- build progressed through:
|
||||||
|
- `COPY train_cuda.py runpod_handler.py ./`
|
||||||
|
- `exporting to image`
|
||||||
|
|
||||||
|
But:
|
||||||
|
|
||||||
|
- the image was not yet visible afterward in `docker images`
|
||||||
|
- therefore the build still needs one more clean verification pass
|
||||||
|
|
||||||
|
## Current Bottleneck
|
||||||
|
|
||||||
|
The remaining blocker is no longer MAGATAMA lane logic or adoption code.
|
||||||
|
|
||||||
|
It is now:
|
||||||
|
|
||||||
|
1. publish the custom worker image to a registry RunPod can consume
|
||||||
|
2. create/switch the endpoint to that image
|
||||||
|
3. set on Erik:
|
||||||
|
- `RUNPOD_WORKER_KIND=custom-magatama`
|
||||||
|
- `RUNPOD_ENDPOINT_ID=<custom endpoint id>`
|
||||||
|
|
||||||
|
Only then can MAGATAMA complete the intended full automation:
|
||||||
|
|
||||||
|
- training pool refresh
|
||||||
|
- lane-specific dataset build
|
||||||
|
- RunPod fine-tune
|
||||||
|
- returned artifact reference
|
||||||
|
- local adoption/import
|
||||||
|
- smoke tests
|
||||||
|
- new release alias
|
||||||
|
- active alias switch
|
||||||
Loading…
x
Reference in New Issue
Block a user