Compare commits
2 Commits
d61c3f7982
...
7f4e7f03ad
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
7f4e7f03ad | ||
|
|
b60fb362e8 |
@ -1,9 +1,67 @@
|
||||
# Current TIP Sync State
|
||||
|
||||
Updated: 2026-05-09 16:00 UTC
|
||||
Updated: 2026-05-09 16:05 UTC
|
||||
|
||||
## Newest Work
|
||||
|
||||
- MAGATAMA training live cleanup and TIP_LLM adoption closure on 2026-05-09:
|
||||
- operator requirement:
|
||||
- no local Mac Studio training may consume the full workstation by default
|
||||
- RunPod success must mean artifact exists, local import works, alias/version switches, smoke tests pass, and metadata is written back
|
||||
- stale RunPod jobs must not keep the UI in a fake "running" state
|
||||
- live cleanup completed:
|
||||
- cancelled stale RunPod job `83baffe9-d702-43fc-a2b0-bd5818b74059-e2` on old endpoint `ocnuj82cowe2ym`
|
||||
- copied local `tip_llm-last_run.json` back to Erik under `/root/magatama-llm/fine-tuning/`
|
||||
- appended remote training registry event `completed_and_adopted` for custom-worker job `dd35df4a-99f7-468f-8c9e-be19baa78338-e1`
|
||||
- live dashboard now reports `activeRun: null` for `tip_llm` instead of stale in-queue work
|
||||
- adopted model state:
|
||||
- active TIP_LLM alias is `tip-llm-v1`
|
||||
- release alias is `tip-llm-v1-r1`
|
||||
- source artifact is `renefichtmueller/magatama-tip-llm-tip-llm-2026-05-09t13-16-14`
|
||||
- local smoke test returned exact `TIP_OK`
|
||||
- dashboard hardening:
|
||||
- stale active training detection now collapses registry rows by job/run and ignores terminal, expired, 404, or cancelled RunPod jobs
|
||||
- deployed patched `packages/dashboard/dist/server.js` and restarted `magatama-dashboard`
|
||||
- Mac Studio safety:
|
||||
- local training now defaults to `nice=+10`, BLAS/OpenMP thread caps of `4`, tokenizer parallelism off, and MPS high-watermark ratio `0.70`
|
||||
- full-speed local training requires explicit `MAGATAMA_LOCAL_TRAIN_UNTHROTTLED=1`
|
||||
- live verification:
|
||||
- `tip_llm` reports `modelVersion=tip-llm-v1-r1`, `lastRegistryRunStatus=completed_and_adopted`, `activeRun=null`
|
||||
- `fo_blogllm` still uses its lane-specific pool and active provider `ollama:fo-blog-v7`
|
||||
- open:
|
||||
- run the same hardened custom-worker end-to-end path for `magatamallm` and the next `fo_blogllm` version
|
||||
- keep Gitea/proxmox mirror work as a separate infrastructure closure item
|
||||
|
||||
- ATGBICS deterministic special-case backfill on 2026-05-09:
|
||||
- precheck:
|
||||
- after the explicit URL evidence pass, ATGBICS still had `139` near-complete rows
|
||||
- `32` matched safe protocol/product-class cases:
|
||||
- loopback/test modules
|
||||
- 10GBASE-T / RJ45 copper
|
||||
- 10GBASE-LRM
|
||||
- BX60 / BXD-60 / BXU-60
|
||||
- CWDM 10G 60km
|
||||
- CSR rows
|
||||
- DB correction:
|
||||
- loopback/test modules -> `N/A` reach/fiber/wavelength, `Loopback / Test Module`
|
||||
- 10GBASE-T/RJ45 -> `30m`, `Copper`, `N/A`
|
||||
- LRM -> `220m`, `MMF`, `1310`
|
||||
- BX60 -> `60km`, `SMF`, directional BiDi wavelength evidence
|
||||
- CWDM 10G 60 -> `60km`, `SMF`, source wavelength
|
||||
- CSR -> `400m`, `MMF`, `850`
|
||||
- result:
|
||||
- `32` ATGBICS rows detail-verified
|
||||
- `32` additional rows promoted to fully verified
|
||||
- ATGBICS near-complete missing details reduced from `139` to `107`
|
||||
- global `details_verified=12030`
|
||||
- global `fully_verified=10753`
|
||||
- health:
|
||||
- public TIP health stayed `healthy`
|
||||
- load status `ok`
|
||||
- memory used `12%`
|
||||
- truth:
|
||||
- remaining ATGBICS rows need detail-page extraction; they are mostly generic OEM/part-number pages where URL slug does not encode the reach
|
||||
|
||||
- ATGBICS explicit URL evidence backfill on 2026-05-09:
|
||||
- precheck:
|
||||
- ATGBICS had `485` price+image+URL-complete rows still lacking detail verification
|
||||
|
||||
@ -0,0 +1,40 @@
|
||||
# ATGBICS Deterministic Special-Case Backfill - 2026-05-09
|
||||
|
||||
## Precheck
|
||||
|
||||
- After the explicit URL evidence pass, ATGBICS still had `139` near-complete rows
|
||||
- `32` matched safe protocol/product-class cases:
|
||||
- loopback/test modules
|
||||
- 10GBASE-T / RJ45 copper
|
||||
- 10GBASE-LRM
|
||||
- BX60 / BXD-60 / BXU-60
|
||||
- CWDM 10G 60km
|
||||
- CSR rows
|
||||
|
||||
## DB Correction
|
||||
|
||||
- Loopback/test modules -> `N/A` reach/fiber/wavelength, `Loopback / Test Module`
|
||||
- 10GBASE-T/RJ45 -> `30m`, `Copper`, `N/A`
|
||||
- LRM -> `220m`, `MMF`, `1310`
|
||||
- BX60 -> `60km`, `SMF`, directional BiDi wavelength evidence
|
||||
- CWDM 10G 60 -> `60km`, `SMF`, source wavelength
|
||||
- CSR -> `400m`, `MMF`, `850`
|
||||
|
||||
## Result
|
||||
|
||||
- `32` ATGBICS rows detail-verified
|
||||
- `32` additional rows promoted to fully verified
|
||||
- ATGBICS near-complete missing details reduced from `139` to `107`
|
||||
- Global `details_verified=12030`
|
||||
- Global `fully_verified=10753`
|
||||
|
||||
## Health
|
||||
|
||||
- TIP public health stayed `healthy`
|
||||
- Load status stayed `ok`
|
||||
- Memory used `12%`
|
||||
|
||||
## Truth Policy
|
||||
|
||||
Remaining ATGBICS rows need detail-page extraction. They are mostly generic OEM/part-number pages where the URL slug does not encode the reach.
|
||||
|
||||
73
sync/history/2026-05-09-magatama-training-live-cleanup.md
Normal file
73
sync/history/2026-05-09-magatama-training-live-cleanup.md
Normal file
@ -0,0 +1,73 @@
|
||||
# MAGATAMA Training Live Cleanup and TIP_LLM Adoption Closure
|
||||
|
||||
Date: 2026-05-09
|
||||
|
||||
## Context
|
||||
|
||||
MAGATAMA training automation previously treated RunPod `COMPLETED` as too strong a success signal even when the expected model artifact was not visible or imported. The UI also kept stale RunPod jobs visible as active training. The operator also required Mac Studio local training to stay throttled so normal workstation use remains possible.
|
||||
|
||||
## Completed
|
||||
|
||||
- Adopted the custom-worker TIP_LLM artifact locally:
|
||||
- artifact: `renefichtmueller/magatama-tip-llm-tip-llm-2026-05-09t13-16-14`
|
||||
- active alias: `tip-llm-v1`
|
||||
- release alias: `tip-llm-v1-r1`
|
||||
- live smoke: prompt "Reply with exactly TIP_OK" returned `TIP_OK`
|
||||
- Copied the local TIP_LLM last-run metadata back to Erik:
|
||||
- source: `/Users/renefichtmueller/magatama-llm/fine-tuning/tip_llm-last_run.json`
|
||||
- target: `/root/magatama-llm/fine-tuning/tip_llm-last_run.json`
|
||||
- Appended a remote registry event marking the real successful custom-worker run as `completed_and_adopted`:
|
||||
- job: `dd35df4a-99f7-468f-8c9e-be19baa78338-e1`
|
||||
- run id: `tip_llm-2026-05-09T13-16-14`
|
||||
- endpoint: `0rmkf28w2g5gip`
|
||||
- Cancelled stale old-endpoint work that kept the UI confused:
|
||||
- endpoint: `ocnuj82cowe2ym`
|
||||
- job: `83baffe9-d702-43fc-a2b0-bd5818b74059-e2`
|
||||
- final status: `CANCELLED`
|
||||
- Hardened dashboard active-run detection:
|
||||
- collapses registry rows by job/run key
|
||||
- ignores terminal, stale, cancelled, expired, 404, and otherwise non-active RunPod jobs
|
||||
- passes the dynamic lane endpoint into active-run lookup
|
||||
- deployed patched dashboard server bundle and restarted `magatama-dashboard`
|
||||
- Hardened local Mac Studio training defaults:
|
||||
- `nice=+10`
|
||||
- `OMP_NUM_THREADS=4`
|
||||
- `MKL_NUM_THREADS=4`
|
||||
- `OPENBLAS_NUM_THREADS=4`
|
||||
- `VECLIB_MAXIMUM_THREADS=4`
|
||||
- `NUMEXPR_NUM_THREADS=4`
|
||||
- `TOKENIZERS_PARALLELISM=false`
|
||||
- `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.70`
|
||||
- full unthrottled local training now requires explicit `MAGATAMA_LOCAL_TRAIN_UNTHROTTLED=1`
|
||||
|
||||
## Live Verification
|
||||
|
||||
- `tip_llm` live status:
|
||||
- active provider: `ollama:tip-llm-v1`
|
||||
- model version: `tip-llm-v1-r1`
|
||||
- last registry status: `completed_and_adopted`
|
||||
- active run: `null`
|
||||
- last training timestamp: `2026-05-09T14:48:24Z`
|
||||
- `fo_blogllm` live status:
|
||||
- active provider: `ollama:fo-blog-v7`
|
||||
- lane-specific source: `/opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
|
||||
- current pool: `17322` train, `1926` eval, `19267` total
|
||||
|
||||
## Decisions
|
||||
|
||||
- A training run is not successful unless all gates pass:
|
||||
- dataset prepared from the lane's own pool
|
||||
- RunPod job completes
|
||||
- expected artifact exists
|
||||
- artifact imports locally
|
||||
- Ollama alias/version is switched
|
||||
- smoke tests pass
|
||||
- metadata and registry are written back
|
||||
- Mac Studio local training stays throttled by default.
|
||||
- RunPod Serverless can stay, but the generic managed Axolotl endpoint is not trustworthy for adoption unless it publishes artifacts. The custom MAGATAMA worker path is the reliable path.
|
||||
|
||||
## Open
|
||||
|
||||
- Repeat the hardened custom-worker end-to-end path for `magatamallm`.
|
||||
- Repeat the hardened custom-worker end-to-end path for the next `fo_blogllm` version.
|
||||
- Mirror the Gitea learning pools between hosted Gitea and Proxmox Gitea as a separate infrastructure task.
|
||||
Loading…
x
Reference in New Issue
Block a user