4.4 KiB
MAGATAMA Training Pipeline Recovery And TIP_LLM Adoption
Date: 2026-05-09 Actor: Codex Scope: MAGATAMA training automation, RunPod custom worker, local Ollama adoption Mode: local Mac Studio verification plus Gitea sync handoff
Operator Intent
Training must be real end-to-end training, not a cosmetic COMPLETED state. A run is only successful when all of these are true:
- lane-specific training pool was exported from Gitea/RunPod data
- RunPod worker produced a visible model/adaptor artifact
- artifact was downloaded and converted locally
- local Ollama model tag was updated
- version alias was advanced
- smoke tests passed
- last-run metadata was written back
Local Mac Studio training must also stay workstation-friendly and must not consume the whole machine by default.
Completed
TIP_LLM was successfully adopted locally from the custom RunPod worker output.
- RunPod custom endpoint:
0rmkf28w2g5gip - worker image:
gitea.context-x.org/rene/magatama-runpod-worker:20260509-1246 - published adapter:
renefichtmueller/magatama-tip-llm-tip-llm-2026-05-09t13-16-14 - worker summary:
RunPod QLoRA complete - train=144 - valid=17 - local candidate:
tip-llm-runpod-tip_llm-2026-05-09t13-16-14 - release alias:
tip-llm-v1-r1 - active alias:
tip-llm-v1 - final smoke: Ollama
tip-llm-v1answered exactlyTIP_OK
Pipeline Fixes Applied
- Built a local Python venv for the MAGATAMA train API under
/Users/renefichtmueller/magatama-llm/service/.venv. - Installed missing runtime dependencies including
peft,torch,accelerate,safetensors,transformers,huggingface_hub,sentencepiece,protobuf,fastapi,uvicorn, andgguf. - Started/reconfirmed local Ollama runtime.
- Hardened the Ollama converter:
- primary path still uses Ollama HTTP API
- if HTTP streaming/create fails, falls back to
ollama create -f <Modelfile> - resolves Homebrew Ollama paths such as
/opt/homebrew/bin/ollama
- Hardened adoption scripts:
- reuse existing valid GGUF output instead of restarting huge conversions unnecessarily
- remove stale tiny/failed GGUF files before reconversion
- allow a smoke threshold of at least 80 percent instead of failing a whole adoption on one brittle prompt
- Fixed
training_api.pyOllama binary resolution so alias creation works from LaunchAgent/service context. - Added local Mac Studio resource guardrails:
- default
nice=+10 - BLAS/tokenizer worker limits default to
4threads PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.70- explicit override only with
MAGATAMA_LOCAL_TRAIN_UNTHROTTLED=1
- default
Verification
- Python syntax checks passed for:
- operational MAGATAMA train API scripts
- MAGATAMA repo fine-tuner scripts
- operational and repo LLM gateway converter scripts
- Local train API health endpoint is reachable after restart.
- Ollama
/api/tagsshows:tip-llm-v1tip-llm-v1-r1tip-llm-runpod-tip_llm-2026-05-09t13-16-14
- Active model smoke succeeded:
- prompt:
Reply with exactly TIP_OK - response:
TIP_OK
- prompt:
Current Truth
- TIP_LLM now has a real trained RunPod-returned model behind
tip-llm-v1. - The old managed Axolotl serverless path is not sufficient for automatic success detection because it can report
COMPLETEDwithout publishing the expected HuggingFace artifact. - The custom MAGATAMA RunPod worker is the right path because it must explicitly upload the adapter/model artifact.
COMPLETEDmust never be accepted as success unless artifact retrieval, import, alias switch and smoke tests pass.- Local training is safe-by-default and should not fully saturate the Mac Studio unless the operator explicitly opts out.
Open Follow-Up
- Apply the same custom-worker artifact contract to
magatamallmandfo_blogllm. - Run new end-to-end training for
magatamallmandfo_blogllmonly through the hardened custom worker path. - Add TIP_LLM training data for controller policy:
- Erik is only a cautious controller
- heavy crawler/scraper/browser work belongs on Proxmox/Pis
- TIP_LLM plans robots/crawlers, but does not overload Erik
- Public MAGATAMA status check should be rechecked from a non-sandbox network context if DNS or Cloudflare route looks stale.
Sync Note
User requested all new decisions and current chat state to be written into sync/ so Codex, Claude and the laptop share one handoff. This file captures the training pipeline recovery state and should be treated as the current binding training handoff until superseded.