transceiver-db/sync/history/2026-05-09-magatama-training-pipeline-tip-llm-adoption.md
2026-05-09 17:18:35 +02:00

4.4 KiB

MAGATAMA Training Pipeline Recovery And TIP_LLM Adoption

Date: 2026-05-09 Actor: Codex Scope: MAGATAMA training automation, RunPod custom worker, local Ollama adoption Mode: local Mac Studio verification plus Gitea sync handoff

Operator Intent

Training must be real end-to-end training, not a cosmetic COMPLETED state. A run is only successful when all of these are true:

  • lane-specific training pool was exported from Gitea/RunPod data
  • RunPod worker produced a visible model/adaptor artifact
  • artifact was downloaded and converted locally
  • local Ollama model tag was updated
  • version alias was advanced
  • smoke tests passed
  • last-run metadata was written back

Local Mac Studio training must also stay workstation-friendly and must not consume the whole machine by default.

Completed

TIP_LLM was successfully adopted locally from the custom RunPod worker output.

  • RunPod custom endpoint: 0rmkf28w2g5gip
  • worker image: gitea.context-x.org/rene/magatama-runpod-worker:20260509-1246
  • published adapter: renefichtmueller/magatama-tip-llm-tip-llm-2026-05-09t13-16-14
  • worker summary: RunPod QLoRA complete - train=144 - valid=17
  • local candidate: tip-llm-runpod-tip_llm-2026-05-09t13-16-14
  • release alias: tip-llm-v1-r1
  • active alias: tip-llm-v1
  • final smoke: Ollama tip-llm-v1 answered exactly TIP_OK

Pipeline Fixes Applied

  • Built a local Python venv for the MAGATAMA train API under /Users/renefichtmueller/magatama-llm/service/.venv.
  • Installed missing runtime dependencies including peft, torch, accelerate, safetensors, transformers, huggingface_hub, sentencepiece, protobuf, fastapi, uvicorn, and gguf.
  • Started/reconfirmed local Ollama runtime.
  • Hardened the Ollama converter:
    • primary path still uses Ollama HTTP API
    • if HTTP streaming/create fails, falls back to ollama create -f <Modelfile>
    • resolves Homebrew Ollama paths such as /opt/homebrew/bin/ollama
  • Hardened adoption scripts:
    • reuse existing valid GGUF output instead of restarting huge conversions unnecessarily
    • remove stale tiny/failed GGUF files before reconversion
    • allow a smoke threshold of at least 80 percent instead of failing a whole adoption on one brittle prompt
  • Fixed training_api.py Ollama binary resolution so alias creation works from LaunchAgent/service context.
  • Added local Mac Studio resource guardrails:
    • default nice=+10
    • BLAS/tokenizer worker limits default to 4 threads
    • PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.70
    • explicit override only with MAGATAMA_LOCAL_TRAIN_UNTHROTTLED=1

Verification

  • Python syntax checks passed for:
    • operational MAGATAMA train API scripts
    • MAGATAMA repo fine-tuner scripts
    • operational and repo LLM gateway converter scripts
  • Local train API health endpoint is reachable after restart.
  • Ollama /api/tags shows:
    • tip-llm-v1
    • tip-llm-v1-r1
    • tip-llm-runpod-tip_llm-2026-05-09t13-16-14
  • Active model smoke succeeded:
    • prompt: Reply with exactly TIP_OK
    • response: TIP_OK

Current Truth

  • TIP_LLM now has a real trained RunPod-returned model behind tip-llm-v1.
  • The old managed Axolotl serverless path is not sufficient for automatic success detection because it can report COMPLETED without publishing the expected HuggingFace artifact.
  • The custom MAGATAMA RunPod worker is the right path because it must explicitly upload the adapter/model artifact.
  • COMPLETED must never be accepted as success unless artifact retrieval, import, alias switch and smoke tests pass.
  • Local training is safe-by-default and should not fully saturate the Mac Studio unless the operator explicitly opts out.

Open Follow-Up

  • Apply the same custom-worker artifact contract to magatamallm and fo_blogllm.
  • Run new end-to-end training for magatamallm and fo_blogllm only through the hardened custom worker path.
  • Add TIP_LLM training data for controller policy:
    • Erik is only a cautious controller
    • heavy crawler/scraper/browser work belongs on Proxmox/Pis
    • TIP_LLM plans robots/crawlers, but does not overload Erik
  • Public MAGATAMA status check should be rechecked from a non-sandbox network context if DNS or Cloudflare route looks stale.

Sync Note

User requested all new decisions and current chat state to be written into sync/ so Codex, Claude and the laptop share one handoff. This file captures the training pipeline recovery state and should be treated as the current binding training handoff until superseded.