transceiver-db/sync/CURRENT.md
2026-05-07 11:04:22 +02:00

945 lines
47 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Current TIP Sync State
Updated: 2026-05-07 08:05 UTC
## Newest Work
- MAGATAMA RunPod custom worker preparation continued on 2026-05-07:
- the pending sync handoff was committed and **successfully pushed to Gitea**:
- commit:
- `2a35761 sync: record runpod managed endpoint root cause`
- MAGATAMA repo now includes an explicit helper for building/publishing the custom RunPod worker image:
- `magatama/scripts/runpod_worker_publish.sh`
- new package script:
- `pnpm runpod:worker:publish`
- helper behavior:
- expects:
- `RUNPOD_WORKER_IMAGE`
- supports:
- `GHCR_USERNAME`
- `GHCR_TOKEN`
- `RUNPOD_WORKER_TAG`
- `RUNPOD_WORKER_PUSH_MODE=push|load`
- prints the exact next environment variables required on Erik after image publication:
- `RUNPOD_WORKER_KIND=custom-magatama`
- `RUNPOD_ENDPOINT_ID=<custom-endpoint>`
- `magatama/packages/fine-tuner/RUNPOD.md` was extended so the full automation target is now documented end-to-end:
- lane pool sync
- RunPod dataset URL bundle
- custom worker training
- adapter upload
- local adoption
- smoke tests
- release alias minting
- active alias switch
- Erik infrastructure truth was rechecked:
- `docker` exists:
- `/usr/bin/docker`
- `docker buildx` exists:
- `github.com/docker/buildx v0.33.0`
- **no docker registry login/config** is currently present on Erik:
- `~/.docker/config.json` absent
- interpretation:
- Erik can build images
- but cannot yet push a public/private worker image to GHCR/Docker Hub without credentials or a pre-authenticated registry path
- the missing custom worker files were synced live to Erik:
- `/opt/magatama/packages/fine-tuner/Dockerfile.runpod`
- `/opt/magatama/packages/fine-tuner/RUNPOD.md`
- a real remote worker image build was then attempted on Erik:
- image tag requested:
- `magatama-runpod-worker:test`
- build truth:
- base `runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04` pulled successfully
- Python dependencies for the worker installed successfully
- build reached:
- `COPY train_cuda.py runpod_handler.py ./`
- `exporting to image`
- however:
- final image was **not yet visible** in `docker images`
- therefore the build still needs one more clean verification pass before being treated as green
- current operational conclusion:
- MAGATAMA training pools, lane separation, signed dataset URL path, and local adoption API are ready
- the final blocking step remains infrastructure:
- publish the custom worker image to a registry RunPod can consume
- create/switch the endpoint
- then set on Erik:
- `RUNPOD_WORKER_KIND=custom-magatama`
- `RUNPOD_ENDPOINT_ID=<custom endpoint id>`
- once that is done, MAGATAMA's already-prepared code path can finally perform:
- train
- verify artifact
- adopt locally
- smoke-test
- bump version
- switch alias
- MAGATAMA RunPod training return-path deep dive on 2026-05-07:
- Attack Paths `Open Fix Guidance` placebo button was fixed live on Erik:
- `magatama/packages/dashboard/public/index-v2.html`
- real behavior now:
- if graph node maps to a real finding, open the existing ticket/finding drawer
- if node is only synthetic, show an explicit warning instead of doing nothing
- deployed to:
- `/opt/magatama/packages/dashboard/public/index-v2.html`
- `pm2 restart magatama-dashboard` executed
- local Mac train API truth rechecked:
- `GET http://127.0.0.1:3214/health`
- returns `status = ok`
- service is idle/reachable, not broken
- RunPod heartbeat/UI stream issue was fixed live:
- dashboard server now emits keepalive progress messages during:
- long `IN_PROGRESS` phases
- post-`COMPLETED` artifact verification loops
- deployed live to Erik dashboard
- direct raw RunPod status canary against the current endpoint (`dheii186pfcuq7`) was executed:
- tiny 1-step `tip_llm` canary job:
- `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`
- observed raw status sequence:
- `IN_QUEUE`
- `IN_PROGRESS`
- `COMPLETED`
- **critical truth**:
- `/status/{job}` returned no `output`
- `/stream/{job}` returned:
- `{"status":"COMPLETED","stream":[]}`
- interpretation:
- the currently configured endpoint is the managed Axolotl serverless endpoint
- it does not return a programmatically adoptable artifact reference to MAGATAMA
- this is why all lanes keep ending in:
- `completed_without_model_artifact`
- Erik secrets reality rechecked:
- `/opt/magatama/secrets/hf-token` exists and is readable by the running process
- therefore the current failure is **not** caused by a missing HF token on Erik
- root cause now considered confirmed:
- the **managed Axolotl serverless endpoint** is acceptable for queueing/running a fine-tune
- but not sufficient for MAGATAMA's required full automation:
- train
- return explicit artifact
- adopt locally
- smoke-test
- create new release alias
- switch active alias
- code path for the correct architecture is now prepared:
- `magatama/packages/fine-tuner/runpod_handler.py`
- `magatama/packages/fine-tuner/train_cuda.py`
- `magatama/packages/fine-tuner/requirements-runpod.txt`
- `magatama/packages/dashboard/src/server.ts`
- what changed in that path:
- custom RunPod worker now accepts:
- `target_model`
- `credentials.hf_token`
- training script now:
- trains lane-specific bundle
- uploads the resulting adapter folder to Hugging Face
- returns `adapter_repo_id`
- dashboard custom-worker submit path now includes:
- `run_id`
- `target_model`
- HF credential pass-through for the worker
- dashboard error text is now explicit:
- if the managed Axolotl endpoint completes without an adoptable artifact, MAGATAMA says so plainly and points at the need for the `custom-magatama` worker
- live deployment status:
- updated dashboard server was rebuilt and deployed to Erik
- updated custom worker source files were synced into Erik repo state
- BUT:
- the currently active RunPod endpoint is still the managed Axolotl endpoint
- the new full return-path logic will only become effective once the RunPod endpoint is switched to the custom MAGATAMA worker image
- operational conclusion:
- training pool refresh, lane separation, submit flow, and local adoption API are now in good shape
- the final missing infrastructure step is:
- build/publish `packages/fine-tuner/Dockerfile.runpod`
- create/use a custom RunPod serverless endpoint for `runpod_handler.py`
- set:
- `RUNPOD_WORKER_KIND=custom-magatama`
- `RUNPOD_ENDPOINT_ID=<custom-endpoint>`
- only then can MAGATAMA honestly achieve:
- automatic training
- automatic artifact return
- automatic adoption
- automatic version bump
- automatic alias switch after smoke tests
## Active Policy
- Put coordination notes and handoffs in this `sync/` folder and push to Gitea.
- Check sibling project sync folders first when context may span repos.
- Use TIPLLM only for TIP crawler/robot planning and extraction feedback.
- Write robot/crawler experience into the Gitea-backed TIPLLM training pool.
- Keep Erik safe: no heavy crawler waves or uncontrolled Playwright/discovery jobs on Erik.
- Use Proxmox/Pi workers for crawl load.
## Cross-Repo Sync
Claude Code also created a Gitea sync handoff in the LLM Gateway repo:
- Repo: `rene/llm-gateway`
- Path: `sync/`
- Commit shown by Claude: `e272105 sync: add chat handoff + context scaffolding for Codex integration (2026-04-29)`
- Gitea path: `http://192.168.178.196:3000/rene/llm-gateway/src/main/sync/`
When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infrastructure, read both:
- `transceiver-db/sync/CURRENT.md`
- `llm-gateway/sync/CURRENT.md`
## Latest Work
- RunPod/MAGATAMA training live follow-up on 2026-05-07:
- latest `magatamallm` serverless run verified on Erik:
- job id:
- `ad003f90-3cf9-43f6-8960-bf6c1ea85097-e2`
- registry truth in:
- `/opt/magatama/training-data/model-registry/training-runs.json`
- observed states:
- `submitted`
- then `completed_without_model_artifact`
- exact recorded warning:
- `RunPod meldete COMPLETED, aber das erwartete HuggingFace-Modellrepo wurde nicht gefunden.`
- interpretation:
- dataset build and RunPod submit are working
- the worker still does not return a verifiable adoptable model artifact
- this is a real training return-path failure, not just a cosmetic UI issue
- local training API truth rechecked:
- `GET http://127.0.0.1:3214/health`
- service responds with:
- `status = ok`
- `service = magatama-train-api`
- `running = false`
- `pid = null`
- meaning:
- API is healthy/reachable
- currently idle
- ready for adoption/import calls once a valid RunPod artifact exists
- one UI bug in the training modal was fixed live:
- root cause:
- during long `IN_PROGRESS` and post-`COMPLETED` artifact verification phases, MAGATAMA sent no heartbeat for too long
- browser/proxy could then terminate the stream and surface only:
- `network error`
- even though Erik had already written the more truthful registry state
- fix:
- `magatama/packages/dashboard/src/server.ts`
- added server-sent heartbeat messages while:
- RunPod status remains unchanged
- Hugging Face / artifact propagation checks are still running
- concrete live strings now deployed in Erik dashboard server:
- `⏳ RunPod arbeitet weiter (...)`
- `⏳ Prüfe Modellartefakt ...`
- deployment:
- rebuilt dashboard
- rsynced `packages/dashboard/dist/server.js` to Erik
- restarted `pm2 magatama-dashboard`
- remote `server.js` verified to contain heartbeat strings
- expected operator effect:
- future training runs should no longer collapse into a late generic `network error` while RunPod/adoption checks are still active
- the UI should stay alive long enough to show the real terminal result:
- `completed_and_adopted`
- or
- `completed_without_model_artifact`
- or
- worker/adoption failure
- MAGATAMA live follow-up on 2026-05-07:
- local Mac training API was rechecked after the lane-specific automation changes.
- current live truth:
- LaunchAgent `org.fichtmueller.magatama-train-api` is present and running
- process listens on `*:3214`
- localhost health now responds when checked outside sandbox restrictions:
- `GET http://127.0.0.1:3214/health`
- response:
- `status = ok`
- `service = magatama-train-api`
- `running = false`
- `pid = null`
- `updated_at = 2026-05-07T04:14:23Z`
- interpretation:
- the training API itself is healthy and reachable
- it is currently idle, not broken
- the actual next proof point must come from a fresh lane run that writes lane-specific `*-last_run.json`
- live Attack Paths UI bug was fixed and deployed to Erik:
- root cause:
- the `Open Fix Guidance` button inside the attack-path side panel only triggered a dummy toast and never opened a real finding/ticket detail
- fix:
- `magatama/packages/dashboard/public/index-v2.html`
- new helper:
- `openFixGuidanceForNode(nodeId)`
- behavior:
- if the clicked graph node maps to a real finding ID, MAGATAMA now opens the existing ticket/finding detail drawer via `openTicket(id)`
- if the node is only a synthetic path node with no backing finding, MAGATAMA now shows an explicit warning instead of pretending to open guidance
- live deployment:
- updated `index-v2.html` was rsynced to:
- `/opt/magatama/packages/dashboard/public/index-v2.html`
- `pm2 restart magatama-dashboard` executed on Erik
- deployed file on Erik verified with:
- `openFixGuidanceForNode`
- `Open Fix Guidance`
- operator consequence:
- Attack Paths no longer contain a placebo “Open Fix Guidance” action
- clicking it should now open the actual MAGATAMA finding/ticket guidance path when the graph node represents a real finding
- MAGATAMA training automation was hardened locally on 2026-05-07 for all three lanes:
- target lanes:
- `magatamallm`
- `fo_blogllm`
- `tip_llm`
- core root cause confirmed:
- RunPod dataset refresh / lane export already worked
- RunPod jobs often reached `COMPLETED`
- but model adoption/version truth still depended on a single shared:
- `~/magatama-llm/fine-tuning/last_run.json`
- this made lane status and successful return/adoption ambiguous across models
- the training modal could also collapse late stream/adoption failures into a generic `network error`
- local code fixes now in place:
- `magatama/packages/fine-tuner/training_api.py`
- lane-specific last-run files added:
- `~/magatama-llm/fine-tuning/magatamallm-last_run.json`
- `~/magatama-llm/fine-tuning/fo_blogllm-last_run.json`
- `~/magatama-llm/fine-tuning/tip_llm-last_run.json`
- legacy `last_run.json` remains only as backward-compatible mirror for `magatamallm`
- successful RunPod adoption now creates:
- a release alias per lane, e.g. `<active-alias>-rN`
- active alias switching sequence is now:
- candidate model imported
- smoke-tested
- release alias created
- stable active alias repointed to that release alias
- adoption report now includes:
- `version_counter`
- `release_alias`
- `magatama/packages/fine-tuner/train.py`
- local metrics writing now also respects lane-specific last-run files via `TRAINING_LANE`
- `magatama/packages/dashboard/src/server.ts`
- `/api/llm/status` now reads lane-specific last-run metadata first
- `release_alias` is preferred as visible model version when present
- RunPod SSE catch now distinguishes:
- real generic training failure
- `COMPLETED` but no artifact / failed adoption
- the latter is now rendered as a truthful return/adoption failure, not a vague dataset/network issue
- `magatama/packages/dashboard/public/index-v2.html`
- training modal now suppresses misleading late generic `network error` if the server already emitted a terminal training status
- if the stream ends without a final terminal server event, the UI now explicitly says the registry/adoption state must be checked
- if the backend reports:
- completed without artifact
- completed without HF model
- completed but adoption failed
the modal now shows that exact reason
- local verification:
- `python3 -m py_compile` passed for:
- `training_api.py`
- `train.py`
- dashboard build passed:
- `pnpm -C packages/dashboard build`
- current operational blocker:
- live deployment to Erik was **not yet completed in this step**
- direct SSH checks returned:
- `Connection refused`
- then `Operation timed out`
- because of that, the new lane-specific automation logic is locally ready, but not yet confirmed live on Erik for the currently running:
- `tip_llm`
- `fo_blogllm`
- practical consequence:
- the code path is now prepared for full automation:
- pull from lane-specific training pool
- train on RunPod
- verify artifact existence
- adopt locally
- create new release alias/version
- repoint stable active alias
- show truthful status in UI
- but the current live Erik run still needs redeploy + verification once SSH is reachable again
- MAGATAMA local MagatamaLLM training state was re-verified on 2026-05-07:
- result:
- the lane export / dataset refresh worked
- a new locally adopted MagatamaLLM model did **not** land
- active MAGATAMA provider remains the older alias:
- `ollama:magatama-coder:latest`
- live/public evidence:
- `GET https://magatama.fichtmueller.org/api/llm/status`
- `activeProvider = ollama:magatama-coder:latest`
- `autoFixProvider = ollama:magatama-coder:latest`
- `training.lastTrainingAt = 2026-05-06T22:43:20Z`
- `training.modelVersion = magatama-coder:latest`
- `training.activeRun = null`
- this means the UI timestamp currently reflects the latest dataset/training-state update, not proof of a newly adopted local model.
- local Mac evidence:
- `ollama list` still shows:
- `magatama-coder:latest` → modified `3 weeks ago`
- `magatama-llm-v2-0:latest` → modified `11 days ago`
- no newer Magatama candidate/import alias appeared locally
- registry/adoption evidence:
- Erik lane manifest exists and is fresh:
- `/opt/magatama/training-data/runpod/magatamallm/manifest.json`
- `generatedAt = 2026-05-06T22:45:15.944Z`
- `train = 15679`
- `eval = 1743`
- `total = 17422`
- but Erik had no populated local adoption/registry state files in:
- `/opt/magatama/training-data/model-registry/models.json`
- `/opt/magatama/training-data/model-registry/runs.json`
- `/opt/magatama/training-data/model-registry/active.json`
- `/opt/magatama/data/llm-status.json`
- local repo only had historical `training-data/model-registry/training-runs.json`
- historical run evidence:
- recent `magatamallm` training-run records still show:
- `submitted`
- then `not_found_after_submit`
- or other non-adopted / worker-failure states
- there is still no verified “completed_and_adopted” proof for a new MagatamaLLM local model.
- operational conclusion:
- current truth:
- dataset/lane preparation works
- local model adoption is still the missing step
- MAGATAMA does **not** currently know more than the already active `magatama-coder:latest` alias
- next fix block remains:
- make RunPod/local completion count only when adoption succeeds
- persist adoption report + model registry state
- update active alias and version only after smoke-tested import succeeds
- MAGATAMA Switchblade port intelligence is now truly flowing end-to-end on 2026-05-06:
- live root cause:
- Switchblade itself already had the rich SG350 data (`description`, LLDP neighbor, peer port, octets), but MAGATAMA had still shown mostly flat port chips.
- verified live on Erik:
- the real Switchblade runtime is the PM2 app `switchblade` under `/opt/switchblade-app`, not the older `/opt/switchblade` tree.
- `GET http://127.0.0.1:3000/api/discovery/snmp` for `192.168.178.2` already returned rich rows such as:
- `GigabitEthernet3` → description `Aruba-1830-UNUSED`, neighbor `VN46KYC0G0`, peer port `11`
- `GigabitEthernet5` → description `Tashi-204`, neighbor `fritz.box`, peer `LAN:1`
- `GigabitEthernet25` → description `to Cisco Business 220 Series`, neighbor `Switch39688E`, peer `gi9`
- the remaining loss point was MAGATAMAs own Switchblade sync/persistence path.
- MAGATAMA sync hardening:
- `scripts/switchblade_live_sync.ts`
- now prefers live SNMP discovery data when it is richer than `/api/devices/<ip>`
- now maps `description`, `peerDevice`, `peerPort`, `connectedHost`, `inOctets`, `outOctets` into rack device ports
- added optional debug snapshot dump support via `SWITCHBLADE_DEBUG_SNAPSHOT_FILE`
- sanitizes unreadable peer-port strings and drops synthetic high-index numeric pseudo-ports
- verified with a forced live run on Erik:
- `Top of Rack Switch` now exports `28` real SG350 ports into the rack snapshot instead of the earlier flattened/odd set
- sample verified payloads before POST:
- port 3 → `Aruba-1830-UNUSED` / `VN46KYC0G0` / `11`
- port 5 → `Tashi-204` / `fritz.box` / `LAN:1`
- port 25 → `to Cisco Business 220 Series` / `Switch39688E` / `gi9`
- MAGATAMA core hardening:
- `packages/core/src/routes/health-types.ts`
- `SwitchbladePortSnapshot` now preserves:
- `description`
- `vlan`
- `macCount`
- `peerDevice`
- `peerPort`
- `connectedHost`
- `transceiver`
- `inOctets`
- `outOctets`
- `packages/core/src/routes/health-support.ts`
- `normalizeSwitchbladePort()` now keeps those additional port fields instead of silently truncating them
- rebuilt locally and re-rsynced the new `packages/core/dist` to Erik
- dashboard/UI hardening:
- `packages/dashboard/public/index-v2.html`
- port chips already had custom tooltip support; now they also carry native `title=` fallback text
- this reduces the old “question mark / unclear hover” problem in browsers that do not immediately show the custom bubble
- live public verification after deploy:
- `GET https://magatama.fichtmueller.org/api/switchblade/snapshot`
- now contains enriched SG350 rack-port records with:
- `description`
- `peerDevice`
- `peerPort`
- `connectedHost`
- `inOctets`
- `outOctets`
- public snapshot timestamp verified:
- `receivedAt = 2026-05-06T22:51:59.247Z`
- `Top of Rack Switch` in the public snapshot now exposes meaningful peer/use-case data instead of only flat status counters
- operator impact:
- MAGATAMA can now answer the actual operational question per port:
- what is on this port
- what is it talking to
- what does the link look like
- this is now grounded in Switchblade live SNMP/LLDP data, not guesswork.
- TIP/Blog lane separation was materially corrected on 2026-05-06:
- root cause:
- `TIP_LLM` was still ingesting blog-/writer-shaped rows from the canonical lane pool and shared transceiver corpora.
- local inspection showed the old TIP export had `6250` train rows, of which `6087` still matched blog/writer patterns.
- dataset builder and Gitea sync were hardened:
- `scripts/runpod_dataset_builder.ts`
- added strict `tipDatasetAllowed(...)`
- `TIP_LLM` now rejects blog-shaped source rows at dataset-build time
- `TIP_LLM` now rejects blog-like `system`, `user`, and markdown-article `assistant` patterns
- registry fallback for `TIP_LLM` now only uses lane-compatible datasets
- `scripts/sync_gitea_training_pool.ts`
- canonical TIP pool refresh now uses the stricter lane-alignment rules
- redundant `merged.jsonl` copies for `fo_blogllm` and `tip_llm` are no longer rewritten, to avoid local disk exhaustion from duplicate lane artifacts
- local disk issue encountered and fixed:
- full refresh failed with `ENOSPC` while writing `training-data/gitea-learning-pool/tip_llm/merged.jsonl`
- redundant lane `merged` artifacts for `fo_blogllm` and `tip_llm` were truncated and the sync script was changed to stop recreating them
- free disk space returned from `377Mi` to `17Gi`
- locally verified after rebuild:
- `TIP_LLM` RunPod export:
- `train = 233`
- `eval = 26`
- `total = 259`
- `blog/writer matches = 0`
- first TIP rows now use the correct TIP system prompt:
- `You are TIP_LLM, a research and market-intelligence analyst for transceivers, switches, and vendor ecosystems...`
- corrected artifacts and scripts were synced to Erik and `pnpm training:refresh-all` was rerun there.
- live verified on Erik/public API:
- `magatamallm`
- `datasetSource = url`
- `collectedExamples = 15679`
- `evalExamples = 1743`
- `totalExamples = 17422`
- `newSinceLastTraining = 15679`
- `fo_blogllm`
- `datasetSource = url`
- `collectedExamples = 17322`
- `evalExamples = 1926`
- `totalExamples = 19254`
- `neverTrained = true`
- `tip_llm`
- `datasetSource = url`
- `collectedExamples = 231`
- `evalExamples = 26`
- `totalExamples = 257`
- `neverTrained = true`
- operational conclusion:
- lane-specific dataset truth is now real on Erik.
- `TIP_LLM` is no longer silently borrowing the FO_Blog behavior lane.
- the next remaining hard problem is now RunPod artifact adoption/validation, not lane contamination.
- MAGATAMA frontend/runtime consistency was repaired again on 2026-05-06:
- dashboard and core were rebuilt locally and redeployed to Erik.
- live processes restarted successfully:
- `magatama-dashboard`
- `magatama`
- public `api/llm/status` now shows the true lane-export totals for `magatamallm`:
- `collectedExamples = 15620`
- `effectiveExamples = 15620`
- `evalExamples = 1736`
- `totalExamples = 17356`
- `newSinceLastTraining = 15620`
- root cause for the stale `1097` display:
- the RunPod start SSE path still logged the legacy deduplicated `fixes.jsonl` corpus.
- this was changed so RunPod launches no longer present the legacy `1097` count as the active training truth.
- after dataset refresh the UI now emits the lane manifest totals instead.
- RunPod completion handling was hardened:
- worker `COMPLETED` is no longer trusted blindly.
- MAGATAMA now scans RunPod worker logs for real training failures (`Traceback`, `SyntaxError`, non-zero exit, etc.) before treating the run as successful.
- if the worker logs show a hidden failure, MAGATAMA records this as `completed_with_worker_failure` instead of pretending the run succeeded.
- public findings state remains currently empty:
- `GET /api/findings?limit=1` returned `{"findings":[],"total":0}`
- this is now rendered with an explicit empty-state row instead of a visually blank table.
- Attack Paths empty-state is now intentionally explicit rather than looking broken.
- Frontend cache and scope handling were hardened:
- cache version bumped to `2026-05-06b`
- stale legacy `magatama_api_cache:*` entries are cleared
- per-endpoint TTLs added
- invalid or empty scope selections are normalized instead of silently leaving the UI in misleading empty views
- Switchblade rack port hover was materially improved:
- port chips now carry `data-tooltip`
- custom tooltip CSS is live on Erik
- the old browser-native “question mark only” behavior should be replaced by a readable hover bubble
- Changelog self-healing was added in core:
- stale cached changelog data older than 6h now forces a rebuild from git history
- verified live via dashboard proxy on Erik:
- `generatedAt = 2026-05-06T15:18:42.708Z`
- latest visible entries include `2026-04-30` items again instead of appearing frozen at `30.05`
- MAGATAMA lane-specific training pools and RunPod dataset automation were finished on 2026-05-06:
- root cause:
- the training modal always fetched `/api/llm/status` without a lane, so `FO_BlogLLM` and `TIP_LLM` still showed the `magatamallm` pool.
- dashboard/server were updated so `/api/llm/status?lane=...` is now truly lane-aware.
- the training modal now refreshes per selected lane and rewrites:
- title
- runtime label
- pool path
- counts
- dataset source
- MAGATAMA dashboard env on Erik was switched to URL dataset mode for all lanes via `ecosystem.config.cjs`:
- `RUNPOD_DATASET_SOURCE=url`
- `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url`
- `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url`
- `RUNPOD_DATASET_SOURCE_TIP_LLM=url`
- live verified on Erik after restart:
- `fo_blogllm`
- `datasetSource = url`
- `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
- `train = 28`
- `eval = 4`
- `total = 32`
- `tip_llm`
- `datasetSource = url`
- `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json`
- `train = 36`
- `eval = 4`
- `total = 40`
- `magatamallm`
- remains on lane-export counts (`15620 / 1736 / 17356`)
- operator impact:
- no Hugging Face dataset publish is required anymore for MAGATAMA RunPod launches.
- every supported LLM lane now points to its own local/Gitea-backed lane export instead of reusing `magatamallm`.
- MAGATAMA training + Attack Paths + Atlas exposure were corrected again on 2026-05-06:
- the RunPod serverless training start failure was not a RunPod outage.
- root cause was missing training scripts on Erik (`training_full_refresh.ts` and related helpers were absent under `/opt/magatama/scripts`).
- Codex synced the full local `magatama/scripts/` tree to Erik, added a safe fallback in `scripts/model_registry_build.ts`, and synced the local `training-data/model-registry/` directory.
- verified on Erik:
- `pnpm training:refresh-all` now succeeds.
- fresh dataset totals after dedupe:
- `magatamallm`: `92,742` raw → `17,356` effective (`15,620 train / 1,736 eval`)
- `fo_blogllm`: `32` total (`28 train / 4 eval`)
- `tip_llm`: `40` total (`36 train / 4 eval`)
- important nuance:
- Codex did **not** execute the final Hugging Face publish step from Erik in this chat.
- local/script/build failures are fixed; external dataset publish still depends on the selected dataset source and explicit publish intent.
- MAGATAMA Attack Paths UX is no longer a misleading blank panel:
- the page now distinguishes between:
- no live attack paths
- historical fallback paths
- empty selected scope (`0 assets in scope`)
- when a user narrows the scope to a rack/location with zero scoped assets, the graph explicitly says so instead of looking broken.
- live dashboard HTML on Erik now contains:
- `Im aktuellen Scope liegen 0 Assets.`
- `Erweitere Standort oder Datacenter / Rack, damit MAGATAMA korrelierbare Assets und Pfade darstellen kann.`
- `Ohne offene mehrstufige Korrelationen bleibt die Graph-Sicht bewusst leer.`
- MAGATAMA code/training hardening was extended:
- `scripts/test_runpod_adapter.py` no longer loads tokenizer/model with `trust_remote_code=True`.
- `scripts/ollama_adapter_bridge.py` no longer loads tokenizer/model with `trust_remote_code=True`.
- this removed the live CODE finding around `HuggingFace trust_remote_code` on Erik.
- Atlas exposure logic was tightened to stop reopening noisy LAN management findings:
- generic `atlas-exposure` findings now only stay operationally open for exposure that is meaningful enough to track as a finding.
- internal RFC1918 management/service ports discovered by the broad atlas scan are no longer promoted into open Guard findings just because they exist on the LAN.
- host-specific posture for Proxmox / Erik / Mac Studio remains the job of explicit host-audit logic.
- after rebuild + deploy + health sync:
- live Postgres open findings returned to `0`.
- Follow-up hardening on the same block:
- the earlier RunPod error path in MAGATAMA dashboard was made more truthful.
- dataset preparation now distinguishes:
- local `training:refresh-all` failure
- optional Hugging Face publish failure
- URL-based dataset mode with no external publish required
- the training SSE flow now explicitly tells the operator whether RunPod is using:
- Hugging Face dataset source
- or MAGATAMA URL-bundle dataset source
- this avoids misleading `RunPod not reachable` wording when the actual failure is in dataset preparation.
- follow-up serverless verification on 2026-05-06 narrowed the remaining fault further:
- MAGATAMA submit logic now verifies that a RunPod job really exists under `/status/{jobId}` instead of trusting `/run`.
- payloads were aligned more closely with the official Axolotl serverless schema:
- `model_type=AutoModelForCausalLM`
- `tokenizer_type=AutoTokenizer`
- dataset `split: train`
- optimizer `adamw_torch_fused`
- verified full run attempt:
- job id `9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2`
- disappeared as `not_found_after_submit` (`404 job not found`)
- verified canary after payload fix:
- job id `a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2`
- immediately materialized as `IN_QUEUE`
- then still disappeared on later reconcile as `not_found_after_submit`
- current conclusion:
- the old MAGATAMA bug is fixed.
- the remaining problem is now likely on the RunPod endpoint/release side: jobs are accepted and briefly queued, but do not survive long enough to produce a durable serverless status lifecycle.
- operational rule:
- do not treat `submitted` or a brief `IN_QUEUE` as proof of a usable serverless training run.
- only trust the run once it reaches `IN_PROGRESS` or a durable terminal state with artifact evidence.
- follow-up training count fix on 2026-05-06 corrected the Training UI source-of-truth:
- MAGATAMA had still shown `1097` because the dashboard was counting the legacy deduplicated fix corpus instead of the current lane-specific RunPod export.
- dashboard now prefers `training-data/runpod/magatamallm/manifest.json` for the visible MagatamaLLM training count.
- synced current lane export to Erik and restarted `magatama-dashboard`.
- verified public API now returns:
- `collectedExamples = 1367`
- `effectiveExamples = 1367`
- `evalExamples = 152`
- `totalExamples = 1519`
- `newSinceLastTraining = 1367`
- if the browser still shows `1097`, treat it as stale cached UI and hard reload.
- MAGATAMA was repaired end-to-end to a clean operational baseline:
- live guard host-audits for Erik, Mac Studio, and Proxmox were corrected and rerun.
- open findings were reduced all the way to `0` in Postgres.
- false-positive Proxmox baseline findings were removed by teaching the audit to treat internal-only management ports and default-only rpcbind exposure as acceptable for this host.
- code scanner false positives from generated/report artifacts remain excluded.
- Live MAGATAMA protection/runtime state after the 2026-05-06 remediation:
- `open findings: 0`
- `queueExecuting: 0`
- `queueBlocked: 0`
- `queueFailed: 0`
- public `/api/health` returns `status: ok`
- public `/api/active-resolvers` returns:
- `MAGATAMA Core: working`
- `MagatamaLLM: working`
- `Claude (secondary): working`
- `Codex (secondary/manual): idle`
- `Copilot (secondary/manual): idle`
- Important resolver truth fix on 2026-05-06:
- live `codex_enabled=false` in MAGATAMA settings was causing Codex to show as a broken resolver.
- dashboard logic was updated so disabled Codex/Copilot now show truthfully as `idle` with `In MAGATAMA settings disabled`, instead of pretending there is a runtime outage.
- the local codex bridge on Erik is reachable but currently reports `auth_required`; do not treat that as a production outage while Codex is intentionally disabled in settings.
- Remaining real operational gap after findings hit zero:
- MAGATAMA still knows more assets than it actively telemeters.
- last public protection proof showed:
- `knownAssets: 79`
- `hostsWithTelemetry: 27`
- `assetsWithoutTelemetry: 52`
- these are currently inventory/discovery-only assets, not open findings, but they remain the next real coverage expansion area.
- MAGATAMA cross-repo state from the same chat is now synced into this handoff:
- Compliance framework cards in MAGATAMA are clickable and open per-framework requirement details.
- MAGATAMA training status was corrected so `New Since Last Training` no longer falsely shows `0`.
- Live verified/deduped MAGATAMA training state after the fix:
- `collectedExamples: 49`
- `rawExamples: 58`
- `duplicateExamples: 9`
- `effectiveExamples: 49`
- `newSinceLastTraining: 49`
- MAGATAMA now filters training metrics to verified/trainable examples only.
- Failed/escalated MAGATAMA remediation records should go to `errors.jsonl`, not the main `fixes.jsonl`, so the next MagatamaLLM run does not train on junk.
- Gitea-backed training pool remains the default target for training writes.
- MAGATAMA coverage-gap and training-integrity hardening on 2026-05-06:
- the earlier `49` medium `atlas-coverage-gap` findings were traced to Atlas treating inventory-only and discovery-only assets as operational protection failures.
- core logic was tightened so Atlas coverage findings now open only for managed operational assets:
- exposure-backed assets
- explicit non-auto owner
- configured telemetry expectation
- critical/high criticality
- infrastructure metadata or managed infra device types
- loopback and passive reference/inventory assets no longer reopen noisy guard findings.
- local build succeeded, the new core dist was deployed to Erik, and the first post-deploy guard scan resolved stale findings.
- live Postgres state after deploy: `open findings = 0`.
- training integrity bug was fixed in `packages/core/src/learning/fix-tracking.ts`:
- verified fixes now append to `training-data/gitea-learning-pool/magatamallm/fixes.jsonl`
- failed/escalated/report-only runs now belong in `errors.jsonl`
- two explicit Codex-written training entries were appended to the MAGATAMA Gitea-backed fixes corpus:
- atlas coverage scope hardening
- training path integrity fix
- corpus cleanup + dedupe was executed afterward:
- pre-dedupe backup kept locally as:
- `magatama/training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl`
- resulting verified corpus:
- `fixes.jsonl = 1,368` unique verified training rows
- resulting failure corpus:
- `errors.jsonl = 4` tracked failed/escalated rows
- integrity report now exists at:
- `magatama/training-data/gitea-learning-pool/magatamallm/corpus-integrity-report.json`
- latest integrity totals:
- `scanned: 1368`
- `verified: 1368`
- `movedToErrors: 4`
- `parseErrors: 0`
- `invalidVerifiedFlag: 0`
- Complete Codex chat sync was added:
- `sync/history/2026-04-29-codex-complete-chat-sync.md`
- captures Ghost/blog updates, LinkedIn voice preferences, LPO/AI-fabric blog edits, Rest-Is-Not-Laziness scheduling replacement, and security notes.
- confirms no secrets were written into sync.
- confirms TIP crawler/robot planning remains TIPLLM-only.
- confirms Erik remains controller/light `erik-safe` only, with heavy crawler work assigned to Proxmox/Pi workers.
- Codex sync-start confirmation was added:
- `sync/history/2026-04-29-codex-sync-start-confirmation.md`
- confirms Codex read this TIP handoff, checked the sibling LLM Gateway handoff, and is treating `sync/` as binding.
- no code changes, crawler jobs, queue waves, PM2 restarts, or Erik load were initiated during this confirmation.
- Codex follow-up on 2026-04-29 clarified the active BlogLLM model:
- TIP shows `fo-blog-v7`, but this is not a normal Ollama GGUF manifest.
- It is a local Adapter Bridge / Mac Studio model backed by the RunPod-trained PEFT adapter:
`/Users/renefichtmueller/Desktop/Claude Code/magatama/training-data/runpod/pod-runs/2026-04-25-fo-tip/final/adapters/fo_blogllm/final-adapter`
- Bridge definition:
`/Users/renefichtmueller/Desktop/Claude Code/magatama/scripts/ollama_adapter_bridge.py`
- TIP API default:
`packages/api/src/llm/client.ts` uses `OLLAMA_LLM_MODEL || "fo-blog-v7"`.
- `fo-blog-v8` remains the next training candidate, not the currently active TIP BlogLLM model.
- Full Codex session handoff was added:
- `sync/history/2026-04-29-codex-full-session-handoff.md`
- covers TIP verification, product image/detail crawling, Blog Engine Hot Topics, TIPLLM robots, training pool, Erik status, and cross-repo sync.
- Added a verification robot controller:
- `packages/scraper/src/robots/verification-robots.ts`
- command: `npm run robots:verification -w packages/scraper -- --status`
- Added TIPLLM robot experience writing:
- `packages/scraper/src/crawler-llm/training-data-writer.ts`
- writes raw robot audit rows and SFT records.
- Added Gitea training pool import to TIP learning-pool build:
- `scripts/tip-learning-pool-build.ts`
- imports `TIP_TRAINING_REPO/qa-pairs/*.jsonl` into the `tip_llm` lane.
- Added docs:
- `docs/TIP_SELFLEARNING_WORKFLOW.md`
- Added package script:
- `packages/scraper/package.json`
- `robots:verification`
## Gitea Training Pool
- Existing local clone: `/tmp/tip-training-data`
- Gitea repo: `rene/tip-training-data`
- Latest pushed training commit:
- `f1c83f8 crawl: add robot-status training records [2026-04-29T20:11:24.091Z]`
- First robot experience record was written to:
- `/tmp/tip-training-data/qa-pairs/robot-control-high.jsonl`
- `/tmp/tip-training-data/robot-experiences/2026-04-29.jsonl`
## MAGATAMA Training / Operations State
- Relevant local repo:
- `/Users/renefichtmueller/Desktop/Claude Code/magatama`
- Latest confirmed live MAGATAMA findings state:
- `open findings: 0` on `2026-05-06`
- Latest confirmed live resolver state:
- `Codex` and `Copilot` intentionally `idle/disabled`
- not a runtime outage, but a settings choice until gateway/bridge auth is intentionally re-enabled
- Latest confirmed live MAGATAMA training metric after dashboard fix:
- `newSinceLastTraining: 49`
- Meaning:
- the old `0` was incorrect.
- the currently visible trainable MAGATAMA corpus is based on verified and deduplicated examples only.
- Latest corpus integrity state after cleanup:
- operational Gitea-backed MAGATAMA training corpus is now much smaller but cleaner:
- `1368` unique verified rows
- `4` live failure/escalation rows in `errors.jsonl`
- do not confuse raw historical volume with real trainable signal.
- Important training integrity rule:
- report-only or failed/escalated records must not be treated as verified training fixes.
- keep them separated from the main verified training corpus.
## Erik Status
- Synced TIPLLM robot/training code to `/opt/tip`.
- Did not start crawler jobs.
- Did not enqueue robot waves.
- Did not restart PM2 services.
- Remote scraper TypeScript build is passing after removing two stale misplaced remote-only duplicate files:
- `/opt/tip/packages/scraper/src/scrapers/scheduler.ts`
- `/opt/tip/packages/scraper/src/vendor-discovery-crawler.ts`
- `tip-api` and `tip-scraper-daemon` are online.
- Shared Erik note from the same chat:
- MAGATAMA dashboard/core were redeployed during compliance/training fixes.
- TIP crawler policy remains unchanged: Erik is controller/light runner only, not heavy crawl execution host.
## Last Live Verification Snapshot
From 2026-04-29:
- Total transceivers: `13,546`
- Price verified: `7,250`
- Image verified: `7,025`
- Details verified: `6,243`
- Fully verified: `5,812`
- Last price observation: `2026-04-29 19:15:53 UTC`
- Last stock observation: `2026-04-29 19:15:56 UTC`
## Latest MAGATAMA Training / RunPod Truth
Confirmed on `2026-05-06`:
- Lane-specific training pools are now materially separated and no longer all fallback to `magatamallm`.
- Live Erik dashboard API now reports:
- `magatamallm`
- `1367 train`
- `152 eval`
- `1519 total`
- `newSinceLastTraining = 1367`
- `fo_blogllm`
- `17353 train`
- `1929 eval`
- `19282 total`
- `newSinceLastTraining = 17353`
- active local model resolves to `fo-blog-v7`
- `tip_llm`
- `6482 train`
- `721 eval`
- `7203 total`
- `newSinceLastTraining = 6482`
- target active model is `tip-llm-v1`, but this model is not yet present locally in Ollama
- Result:
- previous `1097` everywhere was stale / wrong.
- selected lane now controls its own manifest, model label, and training counts.
### Gitea-backed Pool Materialization
- `magatamallm` Gitea pool remains canonical and populated.
- `fo_blogllm` and `tip_llm` Gitea-backed pool folders were previously almost empty; they are now materialized from the local RunPod lane exports.
- Lane manifests and JSONL exports now exist under:
- `training-data/gitea-learning-pool/fo_blogllm/`
- `training-data/gitea-learning-pool/tip_llm/`
### RunPod Completion Hardening
- MAGATAMA dashboard code now treats RunPod `COMPLETED` as success only after:
1. target model artifact is referenced
2. local Mac training API adopts/imports the artifact
3. lane-specific smoke tests pass
4. active Ollama alias is updated
- New local adoption endpoint is:
- `POST /adopt-runpod-model`
### Mac Training API State
- The old LaunchAgent on Mac Studio was still serving the legacy training API from:
- `~/magatama-llm/service/training_api.py`
- It has now been upgraded in place so Erik sees the new adoption-capable API.
- Verified from Erik:
- `http://192.168.178.213:3214/health` returns the new service
- it now exposes `register_script` pointing into the MAGATAMA repo
- `POST /adopt-runpod-model` exists and rejects unauthenticated requests with `401`, proving the route is live
### Still Outstanding
- A fully successful end-to-end RunPod fine-tune with:
- real worker success
- real artifact
- successful local Ollama import
- active alias switch
- smoke-test proof
has not yet been re-verified after the new adoption pipeline was wired in.
- Latest live proof run on `2026-05-06`:
- job id: `2112a7ab-68c2-4411-a44f-6edb7ad377df-e1`
- materialized correctly
- reached `IN_PROGRESS`
- then `COMPLETED`
- but RunPod `status/{job}` returned no `output` object, no model artifact reference, and no Hugging Face repo result
- current MAGATAMA handling now correctly classifies this as `completed_without_model_artifact`, not as success
- `tip_llm-v1` is still not installed locally in Ollama.
### Pulso AI Recommendation
- Keep a shared network/transceiver/switch core corpus with TIP.
- Do not collapse `Pulso AI` into the same instruction lane as `TIP_LLM`.
- Recommended split:
- `TIP_LLM`
- research
- crawler / scraper / robot planning
- vendor / firmware / issue extraction
- `Pulso AI`
- product responses
- support
- diagnostics
- operator explanation layer
## Safe Next Steps
1. Clone or pull Gitea `origin` on laptop/Claude Code.
2. Read this folder first.
3. For BlogLLM work, treat `fo-blog-v7` as Adapter Bridge / PEFT adapter, not as a `~/.ollama` GGUF model.
4. Also read `llm-gateway/sync/CURRENT.md` when work touches shared Erik infrastructure, LLM routing, bridges, auth, TIPLLM, or crawler orchestration.
5. For TIP robot/crawler planning, use TIPLLM only. Do not route this lane through external AI providers.
6. When training pools or model stats look suspicious, prefer verified-only counts and check whether failed/escalated rows polluted the corpus.
7. For MAGATAMA-adjacent work, keep writing learnings back into the Gitea-backed pool and avoid training on report-only pseudo-fixes.
8. If testing robots, start with dry runs only:
```bash
npm run robots:verification -w packages/scraper -- --status
npm run robots:verification -w packages/scraper -- --tipllm-plan --limit=3
npm run robots:verification -w packages/scraper -- --enqueue=details-fast-lane --profile=erik-safe --dry-run
```
9. Only dispatch real crawl work after deciding the target host:
- Erik: `erik-safe`, tiny batches only.
- Pi: `pi-fetch`.
- Proxmox: `proxmox-heavy`.
## Dirty Worktree Note
There are existing uncommitted changes outside `sync/`. Some are Codex work from this session, some appear pre-existing or from earlier Claude/Codex work. Do not blindly revert them. Review `git status --short` before committing broader changes.
## Latest Sync Commits
- `6c42ca7 docs: add shared agent sync handoff`
- `8e7c5aa docs: link llm-gateway sync handoff`
- Pending after this update:
- watch whether any future guard exposure findings are genuine operational issues or new false positives.
- if failures still appear inside `fixes.jsonl`, scrub historic pollution and backfill `errors.jsonl`.