transceiver-db/sync/history/2026-05-09-magatama-training-live-cleanup.md
2026-05-09 17:55:17 +02:00

74 lines
3.2 KiB
Markdown

# MAGATAMA Training Live Cleanup and TIP_LLM Adoption Closure
Date: 2026-05-09
## Context
MAGATAMA training automation previously treated RunPod `COMPLETED` as too strong a success signal even when the expected model artifact was not visible or imported. The UI also kept stale RunPod jobs visible as active training. The operator also required Mac Studio local training to stay throttled so normal workstation use remains possible.
## Completed
- Adopted the custom-worker TIP_LLM artifact locally:
- artifact: `renefichtmueller/magatama-tip-llm-tip-llm-2026-05-09t13-16-14`
- active alias: `tip-llm-v1`
- release alias: `tip-llm-v1-r1`
- live smoke: prompt "Reply with exactly TIP_OK" returned `TIP_OK`
- Copied the local TIP_LLM last-run metadata back to Erik:
- source: `/Users/renefichtmueller/magatama-llm/fine-tuning/tip_llm-last_run.json`
- target: `/root/magatama-llm/fine-tuning/tip_llm-last_run.json`
- Appended a remote registry event marking the real successful custom-worker run as `completed_and_adopted`:
- job: `dd35df4a-99f7-468f-8c9e-be19baa78338-e1`
- run id: `tip_llm-2026-05-09T13-16-14`
- endpoint: `0rmkf28w2g5gip`
- Cancelled stale old-endpoint work that kept the UI confused:
- endpoint: `ocnuj82cowe2ym`
- job: `83baffe9-d702-43fc-a2b0-bd5818b74059-e2`
- final status: `CANCELLED`
- Hardened dashboard active-run detection:
- collapses registry rows by job/run key
- ignores terminal, stale, cancelled, expired, 404, and otherwise non-active RunPod jobs
- passes the dynamic lane endpoint into active-run lookup
- deployed patched dashboard server bundle and restarted `magatama-dashboard`
- Hardened local Mac Studio training defaults:
- `nice=+10`
- `OMP_NUM_THREADS=4`
- `MKL_NUM_THREADS=4`
- `OPENBLAS_NUM_THREADS=4`
- `VECLIB_MAXIMUM_THREADS=4`
- `NUMEXPR_NUM_THREADS=4`
- `TOKENIZERS_PARALLELISM=false`
- `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.70`
- full unthrottled local training now requires explicit `MAGATAMA_LOCAL_TRAIN_UNTHROTTLED=1`
## Live Verification
- `tip_llm` live status:
- active provider: `ollama:tip-llm-v1`
- model version: `tip-llm-v1-r1`
- last registry status: `completed_and_adopted`
- active run: `null`
- last training timestamp: `2026-05-09T14:48:24Z`
- `fo_blogllm` live status:
- active provider: `ollama:fo-blog-v7`
- lane-specific source: `/opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
- current pool: `17322` train, `1926` eval, `19267` total
## Decisions
- A training run is not successful unless all gates pass:
- dataset prepared from the lane's own pool
- RunPod job completes
- expected artifact exists
- artifact imports locally
- Ollama alias/version is switched
- smoke tests pass
- metadata and registry are written back
- Mac Studio local training stays throttled by default.
- RunPod Serverless can stay, but the generic managed Axolotl endpoint is not trustworthy for adoption unless it publishes artifacts. The custom MAGATAMA worker path is the reliable path.
## Open
- Repeat the hardened custom-worker end-to-end path for `magatamallm`.
- Repeat the hardened custom-worker end-to-end path for the next `fo_blogllm` version.
- Mirror the Gitea learning pools between hosted Gitea and Proxmox Gitea as a separate infrastructure task.