sync: record magatama all-lane runpod training start

2026-05-10 01:11:21 +02:00 · 2026-05-10 01:11:21 +02:00 · 5819eb5eb0
commit 5819eb5eb0
parent b51901abdb
2 changed files with 94 additions and 1 deletions
--- a/sync/CURRENT.md
+++ b/sync/CURRENT.md
@ -1,9 +1,39 @@
 # Current TIP Sync State

-Updated: 2026-05-09 22:32 UTC
+Updated: 2026-05-09 23:09 UTC

 ## Newest Work

+- MAGATAMA all-lane RunPod training block started on 2026-05-09:
+  - user requested all trainable LLM lanes via RunPod
+  - lanes in scope:
+    - `magatamallm`
+    - `fo_blogllm`
+    - `tip_llm`
+    - `pulso_llm`
+    - `contact_llm`
+  - preflight:
+    - MAGATAMA services online on Erik
+    - active RunPod endpoint: `0rmkf28w2g5gip`
+    - worker kind: `custom-magatama`
+    - dataset source: URL lane export
+    - latest previous adopted runs existed for `magatamallm`, `fo_blogllm`, `tip_llm`
+    - `pulso_llm` and `contact_llm` had no previous adopted RunPod run
+  - fixed live/local helper script:
+    - `scripts/trigger_lane_training_once.py`
+    - API payload now uses `iters` and `seed_only` instead of stale `iterations` and `seedOnly`
+    - added `all` mode for sequential full-lane training
+    - streams SSE lines to the log instead of buffering until the response closes
+  - live sequence started on Erik:
+    - command: `python3 -u scripts/trigger_lane_training_once.py all 500 false`
+    - log: `/opt/magatama/logs/runpod-all-lanes-20260509T230549Z.log`
+    - first active lane: `magatamallm`
+    - first RunPod job: `89627e7e-8533-45db-9fe8-eca994018aa6-e2`
+    - `magatamallm` dataset at start: `1375 train`, `153 eval`, `1528 total`
+  - success rule remains strict:
+    - RunPod `COMPLETED` alone is not sufficient
+    - artifact must exist, import/adoption must succeed, smoke checks must pass, and active alias/version must update
+
 - TIP verification continuation on 2026-05-09:
  - expanded deterministic non-transceiver quarantine for GBICS and T&S Communication artifacts
  - live quarantine result:
--- a/sync/history/2026-05-09-magatama-all-lane-runpod-training-start.md
+++ b/sync/history/2026-05-09-magatama-all-lane-runpod-training-start.md
@ -0,0 +1,63 @@
+# MAGATAMA All-Lane RunPod Training Start
+
+Date: 2026-05-09 23:09 UTC
+
+## Scope
+
+- Train all current MAGATAMA LLM lanes via RunPod:
+  - `magatamallm`
+  - `fo_blogllm`
+  - `tip_llm`
+  - `pulso_llm`
+  - `contact_llm`
+
+## Preflight
+
+- MAGATAMA services were online on Erik.
+- Active RunPod endpoint reported by MAGATAMA: `0rmkf28w2g5gip`.
+- RunPod worker kind: `custom-magatama`.
+- Dataset source: URL-based lane export.
+- Previous successful/adopted runs existed for:
+  - `magatamallm`
+  - `fo_blogllm`
+  - `tip_llm`
+- No previous run existed yet for:
+  - `pulso_llm`
+  - `contact_llm`
+
+## Runner Fix
+
+- Fixed `scripts/trigger_lane_training_once.py` locally and on Erik.
+- The script previously used stale API keys:
+  - `iterations`
+  - `seedOnly`
+- The MAGATAMA training API expects:
+  - `iters`
+  - `seed_only`
+- Added `all` mode to run all lanes sequentially.
+- Added streamed SSE logging so progress is visible during long RunPod runs.
+
+## Live Run
+
+- Started on Erik:
+  - `python3 -u scripts/trigger_lane_training_once.py all 500 false`
+- Log:
+  - `/opt/magatama/logs/runpod-all-lanes-20260509T230549Z.log`
+- First active lane:
+  - `magatamallm`
+- First RunPod job:
+  - `89627e7e-8533-45db-9fe8-eca994018aa6-e2`
+- Initial `magatamallm` dataset:
+  - `1375 train`
+  - `153 eval`
+  - `1528 total`
+
+## Success Rule
+
+- Do not treat RunPod `COMPLETED` as success by itself.
+- A lane is only successful when:
+  - the model artifact exists,
+  - MAGATAMA imports/adopts it locally,
+  - smoke checks pass,
+  - the active alias/version is updated.
+