transceiver-db/sync/history/2026-05-09-magatama-all-lane-runpod-training-start.md
2026-05-10 01:11:21 +02:00

1.5 KiB

MAGATAMA All-Lane RunPod Training Start

Date: 2026-05-09 23:09 UTC

Scope

  • Train all current MAGATAMA LLM lanes via RunPod:
    • magatamallm
    • fo_blogllm
    • tip_llm
    • pulso_llm
    • contact_llm

Preflight

  • MAGATAMA services were online on Erik.
  • Active RunPod endpoint reported by MAGATAMA: 0rmkf28w2g5gip.
  • RunPod worker kind: custom-magatama.
  • Dataset source: URL-based lane export.
  • Previous successful/adopted runs existed for:
    • magatamallm
    • fo_blogllm
    • tip_llm
  • No previous run existed yet for:
    • pulso_llm
    • contact_llm

Runner Fix

  • Fixed scripts/trigger_lane_training_once.py locally and on Erik.
  • The script previously used stale API keys:
    • iterations
    • seedOnly
  • The MAGATAMA training API expects:
    • iters
    • seed_only
  • Added all mode to run all lanes sequentially.
  • Added streamed SSE logging so progress is visible during long RunPod runs.

Live Run

  • Started on Erik:
    • python3 -u scripts/trigger_lane_training_once.py all 500 false
  • Log:
    • /opt/magatama/logs/runpod-all-lanes-20260509T230549Z.log
  • First active lane:
    • magatamallm
  • First RunPod job:
    • 89627e7e-8533-45db-9fe8-eca994018aa6-e2
  • Initial magatamallm dataset:
    • 1375 train
    • 153 eval
    • 1528 total

Success Rule

  • Do not treat RunPod COMPLETED as success by itself.
  • A lane is only successful when:
    • the model artifact exists,
    • MAGATAMA imports/adopts it locally,
    • smoke checks pass,
    • the active alias/version is updated.