# MAGATAMA Training Pipeline Recovery And TIP_LLM Adoption Date: 2026-05-09 Actor: Codex Scope: MAGATAMA training automation, RunPod custom worker, local Ollama adoption Mode: local Mac Studio verification plus Gitea sync handoff ## Operator Intent Training must be real end-to-end training, not a cosmetic `COMPLETED` state. A run is only successful when all of these are true: - lane-specific training pool was exported from Gitea/RunPod data - RunPod worker produced a visible model/adaptor artifact - artifact was downloaded and converted locally - local Ollama model tag was updated - version alias was advanced - smoke tests passed - last-run metadata was written back Local Mac Studio training must also stay workstation-friendly and must not consume the whole machine by default. ## Completed TIP_LLM was successfully adopted locally from the custom RunPod worker output. - RunPod custom endpoint: `0rmkf28w2g5gip` - worker image: `gitea.context-x.org/rene/magatama-runpod-worker:20260509-1246` - published adapter: `renefichtmueller/magatama-tip-llm-tip-llm-2026-05-09t13-16-14` - worker summary: `RunPod QLoRA complete - train=144 - valid=17` - local candidate: `tip-llm-runpod-tip_llm-2026-05-09t13-16-14` - release alias: `tip-llm-v1-r1` - active alias: `tip-llm-v1` - final smoke: Ollama `tip-llm-v1` answered exactly `TIP_OK` ## Pipeline Fixes Applied - Built a local Python venv for the MAGATAMA train API under `/Users/renefichtmueller/magatama-llm/service/.venv`. - Installed missing runtime dependencies including `peft`, `torch`, `accelerate`, `safetensors`, `transformers`, `huggingface_hub`, `sentencepiece`, `protobuf`, `fastapi`, `uvicorn`, and `gguf`. - Started/reconfirmed local Ollama runtime. - Hardened the Ollama converter: - primary path still uses Ollama HTTP API - if HTTP streaming/create fails, falls back to `ollama create -f ` - resolves Homebrew Ollama paths such as `/opt/homebrew/bin/ollama` - Hardened adoption scripts: - reuse existing valid GGUF output instead of restarting huge conversions unnecessarily - remove stale tiny/failed GGUF files before reconversion - allow a smoke threshold of at least 80 percent instead of failing a whole adoption on one brittle prompt - Fixed `training_api.py` Ollama binary resolution so alias creation works from LaunchAgent/service context. - Added local Mac Studio resource guardrails: - default `nice=+10` - BLAS/tokenizer worker limits default to `4` threads - `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.70` - explicit override only with `MAGATAMA_LOCAL_TRAIN_UNTHROTTLED=1` ## Verification - Python syntax checks passed for: - operational MAGATAMA train API scripts - MAGATAMA repo fine-tuner scripts - operational and repo LLM gateway converter scripts - Local train API health endpoint is reachable after restart. - Ollama `/api/tags` shows: - `tip-llm-v1` - `tip-llm-v1-r1` - `tip-llm-runpod-tip_llm-2026-05-09t13-16-14` - Active model smoke succeeded: - prompt: `Reply with exactly TIP_OK` - response: `TIP_OK` ## Current Truth - TIP_LLM now has a real trained RunPod-returned model behind `tip-llm-v1`. - The old managed Axolotl serverless path is not sufficient for automatic success detection because it can report `COMPLETED` without publishing the expected HuggingFace artifact. - The custom MAGATAMA RunPod worker is the right path because it must explicitly upload the adapter/model artifact. - `COMPLETED` must never be accepted as success unless artifact retrieval, import, alias switch and smoke tests pass. - Local training is safe-by-default and should not fully saturate the Mac Studio unless the operator explicitly opts out. ## Open Follow-Up - Apply the same custom-worker artifact contract to `magatamallm` and `fo_blogllm`. - Run new end-to-end training for `magatamallm` and `fo_blogllm` only through the hardened custom worker path. - Add TIP_LLM training data for controller policy: - Erik is only a cautious controller - heavy crawler/scraper/browser work belongs on Proxmox/Pis - TIP_LLM plans robots/crawlers, but does not overload Erik - Public MAGATAMA status check should be rechecked from a non-sandbox network context if DNS or Cloudflare route looks stale. ## Sync Note User requested all new decisions and current chat state to be written into `sync/` so Codex, Claude and the laptop share one handoff. This file captures the training pipeline recovery state and should be treated as the current binding training handoff until superseded.