sync: record magatama all-lane runpod training start
This commit is contained in:
parent
b51901abdb
commit
5819eb5eb0
@ -1,9 +1,39 @@
|
||||
# Current TIP Sync State
|
||||
|
||||
Updated: 2026-05-09 22:32 UTC
|
||||
Updated: 2026-05-09 23:09 UTC
|
||||
|
||||
## Newest Work
|
||||
|
||||
- MAGATAMA all-lane RunPod training block started on 2026-05-09:
|
||||
- user requested all trainable LLM lanes via RunPod
|
||||
- lanes in scope:
|
||||
- `magatamallm`
|
||||
- `fo_blogllm`
|
||||
- `tip_llm`
|
||||
- `pulso_llm`
|
||||
- `contact_llm`
|
||||
- preflight:
|
||||
- MAGATAMA services online on Erik
|
||||
- active RunPod endpoint: `0rmkf28w2g5gip`
|
||||
- worker kind: `custom-magatama`
|
||||
- dataset source: URL lane export
|
||||
- latest previous adopted runs existed for `magatamallm`, `fo_blogllm`, `tip_llm`
|
||||
- `pulso_llm` and `contact_llm` had no previous adopted RunPod run
|
||||
- fixed live/local helper script:
|
||||
- `scripts/trigger_lane_training_once.py`
|
||||
- API payload now uses `iters` and `seed_only` instead of stale `iterations` and `seedOnly`
|
||||
- added `all` mode for sequential full-lane training
|
||||
- streams SSE lines to the log instead of buffering until the response closes
|
||||
- live sequence started on Erik:
|
||||
- command: `python3 -u scripts/trigger_lane_training_once.py all 500 false`
|
||||
- log: `/opt/magatama/logs/runpod-all-lanes-20260509T230549Z.log`
|
||||
- first active lane: `magatamallm`
|
||||
- first RunPod job: `89627e7e-8533-45db-9fe8-eca994018aa6-e2`
|
||||
- `magatamallm` dataset at start: `1375 train`, `153 eval`, `1528 total`
|
||||
- success rule remains strict:
|
||||
- RunPod `COMPLETED` alone is not sufficient
|
||||
- artifact must exist, import/adoption must succeed, smoke checks must pass, and active alias/version must update
|
||||
|
||||
- TIP verification continuation on 2026-05-09:
|
||||
- expanded deterministic non-transceiver quarantine for GBICS and T&S Communication artifacts
|
||||
- live quarantine result:
|
||||
|
||||
@ -0,0 +1,63 @@
|
||||
# MAGATAMA All-Lane RunPod Training Start
|
||||
|
||||
Date: 2026-05-09 23:09 UTC
|
||||
|
||||
## Scope
|
||||
|
||||
- Train all current MAGATAMA LLM lanes via RunPod:
|
||||
- `magatamallm`
|
||||
- `fo_blogllm`
|
||||
- `tip_llm`
|
||||
- `pulso_llm`
|
||||
- `contact_llm`
|
||||
|
||||
## Preflight
|
||||
|
||||
- MAGATAMA services were online on Erik.
|
||||
- Active RunPod endpoint reported by MAGATAMA: `0rmkf28w2g5gip`.
|
||||
- RunPod worker kind: `custom-magatama`.
|
||||
- Dataset source: URL-based lane export.
|
||||
- Previous successful/adopted runs existed for:
|
||||
- `magatamallm`
|
||||
- `fo_blogllm`
|
||||
- `tip_llm`
|
||||
- No previous run existed yet for:
|
||||
- `pulso_llm`
|
||||
- `contact_llm`
|
||||
|
||||
## Runner Fix
|
||||
|
||||
- Fixed `scripts/trigger_lane_training_once.py` locally and on Erik.
|
||||
- The script previously used stale API keys:
|
||||
- `iterations`
|
||||
- `seedOnly`
|
||||
- The MAGATAMA training API expects:
|
||||
- `iters`
|
||||
- `seed_only`
|
||||
- Added `all` mode to run all lanes sequentially.
|
||||
- Added streamed SSE logging so progress is visible during long RunPod runs.
|
||||
|
||||
## Live Run
|
||||
|
||||
- Started on Erik:
|
||||
- `python3 -u scripts/trigger_lane_training_once.py all 500 false`
|
||||
- Log:
|
||||
- `/opt/magatama/logs/runpod-all-lanes-20260509T230549Z.log`
|
||||
- First active lane:
|
||||
- `magatamallm`
|
||||
- First RunPod job:
|
||||
- `89627e7e-8533-45db-9fe8-eca994018aa6-e2`
|
||||
- Initial `magatamallm` dataset:
|
||||
- `1375 train`
|
||||
- `153 eval`
|
||||
- `1528 total`
|
||||
|
||||
## Success Rule
|
||||
|
||||
- Do not treat RunPod `COMPLETED` as success by itself.
|
||||
- A lane is only successful when:
|
||||
- the model artifact exists,
|
||||
- MAGATAMA imports/adopts it locally,
|
||||
- smoke checks pass,
|
||||
- the active alias/version is updated.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user