transceiver-db/sync/CURRENT.md

# Current TIP Sync State

Updated: 2026-05-07 08:05 UTC

## Newest Work

- MAGATAMA RunPod custom worker preparation continued on 2026-05-07:
  - the pending sync handoff was committed and **successfully pushed to Gitea**:
    - commit:
      - `2a35761 sync: record runpod managed endpoint root cause`
  - MAGATAMA repo now includes an explicit helper for building/publishing the custom RunPod worker image:
    - `magatama/scripts/runpod_worker_publish.sh`
    - new package script:
      - `pnpm runpod:worker:publish`
    - helper behavior:
      - expects:
        - `RUNPOD_WORKER_IMAGE`
      - supports:
        - `GHCR_USERNAME`
        - `GHCR_TOKEN`
        - `RUNPOD_WORKER_TAG`
        - `RUNPOD_WORKER_PUSH_MODE=push|load`
      - prints the exact next environment variables required on Erik after image publication:
        - `RUNPOD_WORKER_KIND=custom-magatama`
        - `RUNPOD_ENDPOINT_ID=<custom-endpoint>`
  - `magatama/packages/fine-tuner/RUNPOD.md` was extended so the full automation target is now documented end-to-end:
    - lane pool sync
    - RunPod dataset URL bundle
    - custom worker training
    - adapter upload
    - local adoption
    - smoke tests
    - release alias minting
    - active alias switch
  - Erik infrastructure truth was rechecked:
    - `docker` exists:
      - `/usr/bin/docker`
    - `docker buildx` exists:
      - `github.com/docker/buildx v0.33.0`
    - **no docker registry login/config** is currently present on Erik:
      - `~/.docker/config.json` absent
    - interpretation:
      - Erik can build images
      - but cannot yet push a public/private worker image to GHCR/Docker Hub without credentials or a pre-authenticated registry path
  - the missing custom worker files were synced live to Erik:
    - `/opt/magatama/packages/fine-tuner/Dockerfile.runpod`
    - `/opt/magatama/packages/fine-tuner/RUNPOD.md`
  - a real remote worker image build was then attempted on Erik:
    - image tag requested:
      - `magatama-runpod-worker:test`
    - build truth:
      - base `runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04` pulled successfully
      - Python dependencies for the worker installed successfully
      - build reached:
        - `COPY train_cuda.py runpod_handler.py ./`
        - `exporting to image`
    - however:
      - final image was **not yet visible** in `docker images`
      - therefore the build still needs one more clean verification pass before being treated as green
  - current operational conclusion:
    - MAGATAMA training pools, lane separation, signed dataset URL path, and local adoption API are ready
    - the final blocking step remains infrastructure:
      - publish the custom worker image to a registry RunPod can consume
      - create/switch the endpoint
      - then set on Erik:
        - `RUNPOD_WORKER_KIND=custom-magatama`
        - `RUNPOD_ENDPOINT_ID=<custom endpoint id>`
    - once that is done, MAGATAMA's already-prepared code path can finally perform:
      - train
      - verify artifact
      - adopt locally
      - smoke-test
      - bump version
      - switch alias

- MAGATAMA RunPod training return-path deep dive on 2026-05-07:
  - Attack Paths `Open Fix Guidance` placebo button was fixed live on Erik:
    - `magatama/packages/dashboard/public/index-v2.html`
    - real behavior now:
      - if graph node maps to a real finding, open the existing ticket/finding drawer
      - if node is only synthetic, show an explicit warning instead of doing nothing
    - deployed to:
      - `/opt/magatama/packages/dashboard/public/index-v2.html`
    - `pm2 restart magatama-dashboard` executed
  - local Mac train API truth rechecked:
    - `GET http://127.0.0.1:3214/health`
    - returns `status = ok`
    - service is idle/reachable, not broken
  - RunPod heartbeat/UI stream issue was fixed live:
    - dashboard server now emits keepalive progress messages during:
      - long `IN_PROGRESS` phases
      - post-`COMPLETED` artifact verification loops
    - deployed live to Erik dashboard
  - direct raw RunPod status canary against the current endpoint (`dheii186pfcuq7`) was executed:
    - tiny 1-step `tip_llm` canary job:
      - `33434e85-3cc1-4dea-9043-83c315aaeb9c-e2`
    - observed raw status sequence:
      - `IN_QUEUE`
      - `IN_PROGRESS`
      - `COMPLETED`
    - **critical truth**:
      - `/status/{job}` returned no `output`
      - `/stream/{job}` returned:
        - `{"status":"COMPLETED","stream":[]}`
    - interpretation:
      - the currently configured endpoint is the managed Axolotl serverless endpoint
      - it does not return a programmatically adoptable artifact reference to MAGATAMA
      - this is why all lanes keep ending in:
        - `completed_without_model_artifact`
  - Erik secrets reality rechecked:
    - `/opt/magatama/secrets/hf-token` exists and is readable by the running process
    - therefore the current failure is **not** caused by a missing HF token on Erik
  - root cause now considered confirmed:
    - the **managed Axolotl serverless endpoint** is acceptable for queueing/running a fine-tune
    - but not sufficient for MAGATAMA's required full automation:
      - train
      - return explicit artifact
      - adopt locally
      - smoke-test
      - create new release alias
      - switch active alias
  - code path for the correct architecture is now prepared:
    - `magatama/packages/fine-tuner/runpod_handler.py`
    - `magatama/packages/fine-tuner/train_cuda.py`
    - `magatama/packages/fine-tuner/requirements-runpod.txt`
    - `magatama/packages/dashboard/src/server.ts`
  - what changed in that path:
    - custom RunPod worker now accepts:
      - `target_model`
      - `credentials.hf_token`
    - training script now:
      - trains lane-specific bundle
      - uploads the resulting adapter folder to Hugging Face
      - returns `adapter_repo_id`
    - dashboard custom-worker submit path now includes:
      - `run_id`
      - `target_model`
      - HF credential pass-through for the worker
    - dashboard error text is now explicit:
      - if the managed Axolotl endpoint completes without an adoptable artifact, MAGATAMA says so plainly and points at the need for the `custom-magatama` worker
  - live deployment status:
    - updated dashboard server was rebuilt and deployed to Erik
    - updated custom worker source files were synced into Erik repo state
    - BUT:
      - the currently active RunPod endpoint is still the managed Axolotl endpoint
      - the new full return-path logic will only become effective once the RunPod endpoint is switched to the custom MAGATAMA worker image
  - operational conclusion:
    - training pool refresh, lane separation, submit flow, and local adoption API are now in good shape
    - the final missing infrastructure step is:
      - build/publish `packages/fine-tuner/Dockerfile.runpod`
      - create/use a custom RunPod serverless endpoint for `runpod_handler.py`
      - set:
        - `RUNPOD_WORKER_KIND=custom-magatama`
        - `RUNPOD_ENDPOINT_ID=<custom-endpoint>`
    - only then can MAGATAMA honestly achieve:
      - automatic training
      - automatic artifact return
      - automatic adoption
      - automatic version bump
      - automatic alias switch after smoke tests

## Active Policy

- Put coordination notes and handoffs in this `sync/` folder and push to Gitea.
- Check sibling project sync folders first when context may span repos.
- Use TIPLLM only for TIP crawler/robot planning and extraction feedback.
- Write robot/crawler experience into the Gitea-backed TIPLLM training pool.
- Keep Erik safe: no heavy crawler waves or uncontrolled Playwright/discovery jobs on Erik.
- Use Proxmox/Pi workers for crawl load.

## Cross-Repo Sync

Claude Code also created a Gitea sync handoff in the LLM Gateway repo:

- Repo: `rene/llm-gateway`
- Path: `sync/`
- Commit shown by Claude: `e272105 sync: add chat handoff + context scaffolding for Codex integration (2026-04-29)`
- Gitea path: `http://192.168.178.196:3000/rene/llm-gateway/src/main/sync/`

When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infrastructure, read both:

- `transceiver-db/sync/CURRENT.md`
- `llm-gateway/sync/CURRENT.md`

## Latest Work

- RunPod/MAGATAMA training live follow-up on 2026-05-07:
  - latest `magatamallm` serverless run verified on Erik:
    - job id:
      - `ad003f90-3cf9-43f6-8960-bf6c1ea85097-e2`
    - registry truth in:
      - `/opt/magatama/training-data/model-registry/training-runs.json`
    - observed states:
      - `submitted`
      - then `completed_without_model_artifact`
    - exact recorded warning:
      - `RunPod meldete COMPLETED, aber das erwartete HuggingFace-Modellrepo wurde nicht gefunden.`
  - interpretation:
    - dataset build and RunPod submit are working
    - the worker still does not return a verifiable adoptable model artifact
    - this is a real training return-path failure, not just a cosmetic UI issue
  - local training API truth rechecked:
    - `GET http://127.0.0.1:3214/health`
    - service responds with:
      - `status = ok`
      - `service = magatama-train-api`
      - `running = false`
      - `pid = null`
    - meaning:
      - API is healthy/reachable
      - currently idle
      - ready for adoption/import calls once a valid RunPod artifact exists
  - one UI bug in the training modal was fixed live:
    - root cause:
      - during long `IN_PROGRESS` and post-`COMPLETED` artifact verification phases, MAGATAMA sent no heartbeat for too long
      - browser/proxy could then terminate the stream and surface only:
        - `network error`
      - even though Erik had already written the more truthful registry state
    - fix:
      - `magatama/packages/dashboard/src/server.ts`
      - added server-sent heartbeat messages while:
        - RunPod status remains unchanged
        - Hugging Face / artifact propagation checks are still running
      - concrete live strings now deployed in Erik dashboard server:
        - `⏳ RunPod arbeitet weiter (...)`
        - `⏳ Prüfe Modellartefakt ...`
    - deployment:
      - rebuilt dashboard
      - rsynced `packages/dashboard/dist/server.js` to Erik
      - restarted `pm2 magatama-dashboard`
      - remote `server.js` verified to contain heartbeat strings
  - expected operator effect:
    - future training runs should no longer collapse into a late generic `network error` while RunPod/adoption checks are still active
    - the UI should stay alive long enough to show the real terminal result:
      - `completed_and_adopted`
      - or
      - `completed_without_model_artifact`
      - or
      - worker/adoption failure

- MAGATAMA live follow-up on 2026-05-07:
  - local Mac training API was rechecked after the lane-specific automation changes.
  - current live truth:
    - LaunchAgent `org.fichtmueller.magatama-train-api` is present and running
    - process listens on `*:3214`
    - localhost health now responds when checked outside sandbox restrictions:
      - `GET http://127.0.0.1:3214/health`
      - response:
        - `status = ok`
        - `service = magatama-train-api`
        - `running = false`
        - `pid = null`
        - `updated_at = 2026-05-07T04:14:23Z`
      - interpretation:
        - the training API itself is healthy and reachable
        - it is currently idle, not broken
        - the actual next proof point must come from a fresh lane run that writes lane-specific `*-last_run.json`
  - live Attack Paths UI bug was fixed and deployed to Erik:
    - root cause:
      - the `Open Fix Guidance` button inside the attack-path side panel only triggered a dummy toast and never opened a real finding/ticket detail
    - fix:
      - `magatama/packages/dashboard/public/index-v2.html`
      - new helper:
        - `openFixGuidanceForNode(nodeId)`
      - behavior:
        - if the clicked graph node maps to a real finding ID, MAGATAMA now opens the existing ticket/finding detail drawer via `openTicket(id)`
        - if the node is only a synthetic path node with no backing finding, MAGATAMA now shows an explicit warning instead of pretending to open guidance
    - live deployment:
      - updated `index-v2.html` was rsynced to:
        - `/opt/magatama/packages/dashboard/public/index-v2.html`
      - `pm2 restart magatama-dashboard` executed on Erik
      - deployed file on Erik verified with:
        - `openFixGuidanceForNode`
        - `Open Fix Guidance`
  - operator consequence:
    - Attack Paths no longer contain a placebo “Open Fix Guidance” action
    - clicking it should now open the actual MAGATAMA finding/ticket guidance path when the graph node represents a real finding

- MAGATAMA training automation was hardened locally on 2026-05-07 for all three lanes:
  - target lanes:
    - `magatamallm`
    - `fo_blogllm`
    - `tip_llm`
  - core root cause confirmed:
    - RunPod dataset refresh / lane export already worked
    - RunPod jobs often reached `COMPLETED`
    - but model adoption/version truth still depended on a single shared:
      - `~/magatama-llm/fine-tuning/last_run.json`
    - this made lane status and successful return/adoption ambiguous across models
    - the training modal could also collapse late stream/adoption failures into a generic `network error`
  - local code fixes now in place:
    - `magatama/packages/fine-tuner/training_api.py`
      - lane-specific last-run files added:
        - `~/magatama-llm/fine-tuning/magatamallm-last_run.json`
        - `~/magatama-llm/fine-tuning/fo_blogllm-last_run.json`
        - `~/magatama-llm/fine-tuning/tip_llm-last_run.json`
      - legacy `last_run.json` remains only as backward-compatible mirror for `magatamallm`
      - successful RunPod adoption now creates:
        - a release alias per lane, e.g. `<active-alias>-rN`
      - active alias switching sequence is now:
        - candidate model imported
        - smoke-tested
        - release alias created
        - stable active alias repointed to that release alias
      - adoption report now includes:
        - `version_counter`
        - `release_alias`
    - `magatama/packages/fine-tuner/train.py`
      - local metrics writing now also respects lane-specific last-run files via `TRAINING_LANE`
    - `magatama/packages/dashboard/src/server.ts`
      - `/api/llm/status` now reads lane-specific last-run metadata first
      - `release_alias` is preferred as visible model version when present
      - RunPod SSE catch now distinguishes:
        - real generic training failure
        - `COMPLETED` but no artifact / failed adoption
      - the latter is now rendered as a truthful return/adoption failure, not a vague dataset/network issue
    - `magatama/packages/dashboard/public/index-v2.html`
      - training modal now suppresses misleading late generic `network error` if the server already emitted a terminal training status
      - if the stream ends without a final terminal server event, the UI now explicitly says the registry/adoption state must be checked
      - if the backend reports:
        - completed without artifact
        - completed without HF model
        - completed but adoption failed
        the modal now shows that exact reason
  - local verification:
    - `python3 -m py_compile` passed for:
      - `training_api.py`
      - `train.py`
    - dashboard build passed:
      - `pnpm -C packages/dashboard build`
  - current operational blocker:
    - live deployment to Erik was **not yet completed in this step**
    - direct SSH checks returned:
      - `Connection refused`
      - then `Operation timed out`
    - because of that, the new lane-specific automation logic is locally ready, but not yet confirmed live on Erik for the currently running:
      - `tip_llm`
      - `fo_blogllm`
  - practical consequence:
    - the code path is now prepared for full automation:
      - pull from lane-specific training pool
      - train on RunPod
      - verify artifact existence
      - adopt locally
      - create new release alias/version
      - repoint stable active alias
      - show truthful status in UI
    - but the current live Erik run still needs redeploy + verification once SSH is reachable again

- MAGATAMA local MagatamaLLM training state was re-verified on 2026-05-07:
  - result:
    - the lane export / dataset refresh worked
    - a new locally adopted MagatamaLLM model did **not** land
    - active MAGATAMA provider remains the older alias:
      - `ollama:magatama-coder:latest`
  - live/public evidence:
    - `GET https://magatama.fichtmueller.org/api/llm/status`
      - `activeProvider = ollama:magatama-coder:latest`
      - `autoFixProvider = ollama:magatama-coder:latest`
      - `training.lastTrainingAt = 2026-05-06T22:43:20Z`
      - `training.modelVersion = magatama-coder:latest`
      - `training.activeRun = null`
    - this means the UI timestamp currently reflects the latest dataset/training-state update, not proof of a newly adopted local model.
  - local Mac evidence:
    - `ollama list` still shows:
      - `magatama-coder:latest` → modified `3 weeks ago`
      - `magatama-llm-v2-0:latest` → modified `11 days ago`
    - no newer Magatama candidate/import alias appeared locally
  - registry/adoption evidence:
    - Erik lane manifest exists and is fresh:
      - `/opt/magatama/training-data/runpod/magatamallm/manifest.json`
      - `generatedAt = 2026-05-06T22:45:15.944Z`
      - `train = 15679`
      - `eval = 1743`
      - `total = 17422`
    - but Erik had no populated local adoption/registry state files in:
      - `/opt/magatama/training-data/model-registry/models.json`
      - `/opt/magatama/training-data/model-registry/runs.json`
      - `/opt/magatama/training-data/model-registry/active.json`
      - `/opt/magatama/data/llm-status.json`
    - local repo only had historical `training-data/model-registry/training-runs.json`
  - historical run evidence:
    - recent `magatamallm` training-run records still show:
      - `submitted`
      - then `not_found_after_submit`
      - or other non-adopted / worker-failure states
    - there is still no verified “completed_and_adopted” proof for a new MagatamaLLM local model.
  - operational conclusion:
    - current truth:
      - dataset/lane preparation works
      - local model adoption is still the missing step
      - MAGATAMA does **not** currently know more than the already active `magatama-coder:latest` alias
    - next fix block remains:
      - make RunPod/local completion count only when adoption succeeds
      - persist adoption report + model registry state
      - update active alias and version only after smoke-tested import succeeds

- MAGATAMA Switchblade port intelligence is now truly flowing end-to-end on 2026-05-06:
  - live root cause:
    - Switchblade itself already had the rich SG350 data (`description`, LLDP neighbor, peer port, octets), but MAGATAMA had still shown mostly flat port chips.
    - verified live on Erik:
      - the real Switchblade runtime is the PM2 app `switchblade` under `/opt/switchblade-app`, not the older `/opt/switchblade` tree.
      - `GET http://127.0.0.1:3000/api/discovery/snmp` for `192.168.178.2` already returned rich rows such as:
        - `GigabitEthernet3` → description `Aruba-1830-UNUSED`, neighbor `VN46KYC0G0`, peer port `11`
        - `GigabitEthernet5` → description `Tashi-204`, neighbor `fritz.box`, peer `LAN:1`
        - `GigabitEthernet25` → description `to Cisco Business 220 Series`, neighbor `Switch39688E`, peer `gi9`
    - the remaining loss point was MAGATAMA’s own Switchblade sync/persistence path.
  - MAGATAMA sync hardening:
    - `scripts/switchblade_live_sync.ts`
      - now prefers live SNMP discovery data when it is richer than `/api/devices/<ip>`
      - now maps `description`, `peerDevice`, `peerPort`, `connectedHost`, `inOctets`, `outOctets` into rack device ports
      - added optional debug snapshot dump support via `SWITCHBLADE_DEBUG_SNAPSHOT_FILE`
      - sanitizes unreadable peer-port strings and drops synthetic high-index numeric pseudo-ports
    - verified with a forced live run on Erik:
      - `Top of Rack Switch` now exports `28` real SG350 ports into the rack snapshot instead of the earlier flattened/odd set
      - sample verified payloads before POST:
        - port 3 → `Aruba-1830-UNUSED` / `VN46KYC0G0` / `11`
        - port 5 → `Tashi-204` / `fritz.box` / `LAN:1`
        - port 25 → `to Cisco Business 220 Series` / `Switch39688E` / `gi9`
  - MAGATAMA core hardening:
    - `packages/core/src/routes/health-types.ts`
      - `SwitchbladePortSnapshot` now preserves:
        - `description`
        - `vlan`
        - `macCount`
        - `peerDevice`
        - `peerPort`
        - `connectedHost`
        - `transceiver`
        - `inOctets`
        - `outOctets`
    - `packages/core/src/routes/health-support.ts`
      - `normalizeSwitchbladePort()` now keeps those additional port fields instead of silently truncating them
    - rebuilt locally and re-rsynced the new `packages/core/dist` to Erik
  - dashboard/UI hardening:
    - `packages/dashboard/public/index-v2.html`
      - port chips already had custom tooltip support; now they also carry native `title=` fallback text
      - this reduces the old “question mark / unclear hover” problem in browsers that do not immediately show the custom bubble
  - live public verification after deploy:
    - `GET https://magatama.fichtmueller.org/api/switchblade/snapshot`
      - now contains enriched SG350 rack-port records with:
        - `description`
        - `peerDevice`
        - `peerPort`
        - `connectedHost`
        - `inOctets`
        - `outOctets`
      - public snapshot timestamp verified:
        - `receivedAt = 2026-05-06T22:51:59.247Z`
    - `Top of Rack Switch` in the public snapshot now exposes meaningful peer/use-case data instead of only flat status counters
  - operator impact:
    - MAGATAMA can now answer the actual operational question per port:
      - what is on this port
      - what is it talking to
      - what does the link look like
    - this is now grounded in Switchblade live SNMP/LLDP data, not guesswork.

- TIP/Blog lane separation was materially corrected on 2026-05-06:
  - root cause:
    - `TIP_LLM` was still ingesting blog-/writer-shaped rows from the canonical lane pool and shared transceiver corpora.
    - local inspection showed the old TIP export had `6250` train rows, of which `6087` still matched blog/writer patterns.
  - dataset builder and Gitea sync were hardened:
    - `scripts/runpod_dataset_builder.ts`
      - added strict `tipDatasetAllowed(...)`
      - `TIP_LLM` now rejects blog-shaped source rows at dataset-build time
      - `TIP_LLM` now rejects blog-like `system`, `user`, and markdown-article `assistant` patterns
      - registry fallback for `TIP_LLM` now only uses lane-compatible datasets
    - `scripts/sync_gitea_training_pool.ts`
      - canonical TIP pool refresh now uses the stricter lane-alignment rules
      - redundant `merged.jsonl` copies for `fo_blogllm` and `tip_llm` are no longer rewritten, to avoid local disk exhaustion from duplicate lane artifacts
  - local disk issue encountered and fixed:
    - full refresh failed with `ENOSPC` while writing `training-data/gitea-learning-pool/tip_llm/merged.jsonl`
    - redundant lane `merged` artifacts for `fo_blogllm` and `tip_llm` were truncated and the sync script was changed to stop recreating them
    - free disk space returned from `377Mi` to `17Gi`
  - locally verified after rebuild:
    - `TIP_LLM` RunPod export:
      - `train = 233`
      - `eval = 26`
      - `total = 259`
      - `blog/writer matches = 0`
    - first TIP rows now use the correct TIP system prompt:
      - `You are TIP_LLM, a research and market-intelligence analyst for transceivers, switches, and vendor ecosystems...`
  - corrected artifacts and scripts were synced to Erik and `pnpm training:refresh-all` was rerun there.
  - live verified on Erik/public API:
    - `magatamallm`
      - `datasetSource = url`
      - `collectedExamples = 15679`
      - `evalExamples = 1743`
      - `totalExamples = 17422`
      - `newSinceLastTraining = 15679`
    - `fo_blogllm`
      - `datasetSource = url`
      - `collectedExamples = 17322`
      - `evalExamples = 1926`
      - `totalExamples = 19254`
      - `neverTrained = true`
    - `tip_llm`
      - `datasetSource = url`
      - `collectedExamples = 231`
      - `evalExamples = 26`
      - `totalExamples = 257`
      - `neverTrained = true`
  - operational conclusion:
    - lane-specific dataset truth is now real on Erik.
    - `TIP_LLM` is no longer silently borrowing the FO_Blog behavior lane.
    - the next remaining hard problem is now RunPod artifact adoption/validation, not lane contamination.

- MAGATAMA frontend/runtime consistency was repaired again on 2026-05-06:
  - dashboard and core were rebuilt locally and redeployed to Erik.
  - live processes restarted successfully:
    - `magatama-dashboard`
    - `magatama`
  - public `api/llm/status` now shows the true lane-export totals for `magatamallm`:
    - `collectedExamples = 15620`
    - `effectiveExamples = 15620`
    - `evalExamples = 1736`
    - `totalExamples = 17356`
    - `newSinceLastTraining = 15620`
  - root cause for the stale `1097` display:
    - the RunPod start SSE path still logged the legacy deduplicated `fixes.jsonl` corpus.
    - this was changed so RunPod launches no longer present the legacy `1097` count as the active training truth.
    - after dataset refresh the UI now emits the lane manifest totals instead.
  - RunPod completion handling was hardened:
    - worker `COMPLETED` is no longer trusted blindly.
    - MAGATAMA now scans RunPod worker logs for real training failures (`Traceback`, `SyntaxError`, non-zero exit, etc.) before treating the run as successful.
    - if the worker logs show a hidden failure, MAGATAMA records this as `completed_with_worker_failure` instead of pretending the run succeeded.
  - public findings state remains currently empty:
    - `GET /api/findings?limit=1` returned `{"findings":[],"total":0}`
    - this is now rendered with an explicit empty-state row instead of a visually blank table.
  - Attack Paths empty-state is now intentionally explicit rather than looking broken.
  - Frontend cache and scope handling were hardened:
    - cache version bumped to `2026-05-06b`
    - stale legacy `magatama_api_cache:*` entries are cleared
    - per-endpoint TTLs added
    - invalid or empty scope selections are normalized instead of silently leaving the UI in misleading empty views
  - Switchblade rack port hover was materially improved:
    - port chips now carry `data-tooltip`
    - custom tooltip CSS is live on Erik
    - the old browser-native “question mark only” behavior should be replaced by a readable hover bubble
  - Changelog self-healing was added in core:
    - stale cached changelog data older than 6h now forces a rebuild from git history
    - verified live via dashboard proxy on Erik:
      - `generatedAt = 2026-05-06T15:18:42.708Z`
      - latest visible entries include `2026-04-30` items again instead of appearing frozen at `30.05`

- MAGATAMA lane-specific training pools and RunPod dataset automation were finished on 2026-05-06:
  - root cause:
    - the training modal always fetched `/api/llm/status` without a lane, so `FO_BlogLLM` and `TIP_LLM` still showed the `magatamallm` pool.
  - dashboard/server were updated so `/api/llm/status?lane=...` is now truly lane-aware.
  - the training modal now refreshes per selected lane and rewrites:
    - title
    - runtime label
    - pool path
    - counts
    - dataset source
  - MAGATAMA dashboard env on Erik was switched to URL dataset mode for all lanes via `ecosystem.config.cjs`:
    - `RUNPOD_DATASET_SOURCE=url`
    - `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url`
    - `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url`
    - `RUNPOD_DATASET_SOURCE_TIP_LLM=url`
  - live verified on Erik after restart:
    - `fo_blogllm`
      - `datasetSource = url`
      - `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
      - `train = 28`
      - `eval = 4`
      - `total = 32`
    - `tip_llm`
      - `datasetSource = url`
      - `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json`
      - `train = 36`
      - `eval = 4`
      - `total = 40`
    - `magatamallm`
      - remains on lane-export counts (`15620 / 1736 / 17356`)
  - operator impact:
    - no Hugging Face dataset publish is required anymore for MAGATAMA RunPod launches.
    - every supported LLM lane now points to its own local/Gitea-backed lane export instead of reusing `magatamallm`.

- MAGATAMA training + Attack Paths + Atlas exposure were corrected again on 2026-05-06:
  - the RunPod serverless training start failure was not a RunPod outage.
  - root cause was missing training scripts on Erik (`training_full_refresh.ts` and related helpers were absent under `/opt/magatama/scripts`).
  - Codex synced the full local `magatama/scripts/` tree to Erik, added a safe fallback in `scripts/model_registry_build.ts`, and synced the local `training-data/model-registry/` directory.
  - verified on Erik:
    - `pnpm training:refresh-all` now succeeds.
    - fresh dataset totals after dedupe:
      - `magatamallm`: `92,742` raw → `17,356` effective (`15,620 train / 1,736 eval`)
      - `fo_blogllm`: `32` total (`28 train / 4 eval`)
      - `tip_llm`: `40` total (`36 train / 4 eval`)
  - important nuance:
    - Codex did **not** execute the final Hugging Face publish step from Erik in this chat.
    - local/script/build failures are fixed; external dataset publish still depends on the selected dataset source and explicit publish intent.
- MAGATAMA Attack Paths UX is no longer a misleading blank panel:
  - the page now distinguishes between:
    - no live attack paths
    - historical fallback paths
    - empty selected scope (`0 assets in scope`)
  - when a user narrows the scope to a rack/location with zero scoped assets, the graph explicitly says so instead of looking broken.
  - live dashboard HTML on Erik now contains:
    - `Im aktuellen Scope liegen 0 Assets.`
    - `Erweitere Standort oder Datacenter / Rack, damit MAGATAMA korrelierbare Assets und Pfade darstellen kann.`
    - `Ohne offene mehrstufige Korrelationen bleibt die Graph-Sicht bewusst leer.`
- MAGATAMA code/training hardening was extended:
  - `scripts/test_runpod_adapter.py` no longer loads tokenizer/model with `trust_remote_code=True`.
  - `scripts/ollama_adapter_bridge.py` no longer loads tokenizer/model with `trust_remote_code=True`.
  - this removed the live CODE finding around `HuggingFace trust_remote_code` on Erik.
- Atlas exposure logic was tightened to stop reopening noisy LAN management findings:
  - generic `atlas-exposure` findings now only stay operationally open for exposure that is meaningful enough to track as a finding.
  - internal RFC1918 management/service ports discovered by the broad atlas scan are no longer promoted into open Guard findings just because they exist on the LAN.
  - host-specific posture for Proxmox / Erik / Mac Studio remains the job of explicit host-audit logic.
  - after rebuild + deploy + health sync:
    - live Postgres open findings returned to `0`.
- Follow-up hardening on the same block:
  - the earlier RunPod error path in MAGATAMA dashboard was made more truthful.
  - dataset preparation now distinguishes:
    - local `training:refresh-all` failure
    - optional Hugging Face publish failure
    - URL-based dataset mode with no external publish required
  - the training SSE flow now explicitly tells the operator whether RunPod is using:
    - Hugging Face dataset source
    - or MAGATAMA URL-bundle dataset source
  - this avoids misleading `RunPod not reachable` wording when the actual failure is in dataset preparation.
  - follow-up serverless verification on 2026-05-06 narrowed the remaining fault further:
    - MAGATAMA submit logic now verifies that a RunPod job really exists under `/status/{jobId}` instead of trusting `/run`.
    - payloads were aligned more closely with the official Axolotl serverless schema:
      - `model_type=AutoModelForCausalLM`
      - `tokenizer_type=AutoTokenizer`
      - dataset `split: train`
      - optimizer `adamw_torch_fused`
    - verified full run attempt:
      - job id `9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2`
      - disappeared as `not_found_after_submit` (`404 job not found`)
    - verified canary after payload fix:
      - job id `a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2`
      - immediately materialized as `IN_QUEUE`
      - then still disappeared on later reconcile as `not_found_after_submit`
    - current conclusion:
      - the old MAGATAMA bug is fixed.
      - the remaining problem is now likely on the RunPod endpoint/release side: jobs are accepted and briefly queued, but do not survive long enough to produce a durable serverless status lifecycle.
    - operational rule:
      - do not treat `submitted` or a brief `IN_QUEUE` as proof of a usable serverless training run.
      - only trust the run once it reaches `IN_PROGRESS` or a durable terminal state with artifact evidence.
  - follow-up training count fix on 2026-05-06 corrected the Training UI source-of-truth:
    - MAGATAMA had still shown `1097` because the dashboard was counting the legacy deduplicated fix corpus instead of the current lane-specific RunPod export.
    - dashboard now prefers `training-data/runpod/magatamallm/manifest.json` for the visible MagatamaLLM training count.
    - synced current lane export to Erik and restarted `magatama-dashboard`.
    - verified public API now returns:
      - `collectedExamples = 1367`
      - `effectiveExamples = 1367`
      - `evalExamples = 152`
      - `totalExamples = 1519`
      - `newSinceLastTraining = 1367`
    - if the browser still shows `1097`, treat it as stale cached UI and hard reload.

- MAGATAMA was repaired end-to-end to a clean operational baseline:
  - live guard host-audits for Erik, Mac Studio, and Proxmox were corrected and rerun.
  - open findings were reduced all the way to `0` in Postgres.
  - false-positive Proxmox baseline findings were removed by teaching the audit to treat internal-only management ports and default-only rpcbind exposure as acceptable for this host.
  - code scanner false positives from generated/report artifacts remain excluded.
- Live MAGATAMA protection/runtime state after the 2026-05-06 remediation:
  - `open findings: 0`
  - `queueExecuting: 0`
  - `queueBlocked: 0`
  - `queueFailed: 0`
  - public `/api/health` returns `status: ok`
  - public `/api/active-resolvers` returns:
    - `MAGATAMA Core: working`
    - `MagatamaLLM: working`
    - `Claude (secondary): working`
    - `Codex (secondary/manual): idle`
    - `Copilot (secondary/manual): idle`
- Important resolver truth fix on 2026-05-06:
  - live `codex_enabled=false` in MAGATAMA settings was causing Codex to show as a broken resolver.
  - dashboard logic was updated so disabled Codex/Copilot now show truthfully as `idle` with `In MAGATAMA settings disabled`, instead of pretending there is a runtime outage.
  - the local codex bridge on Erik is reachable but currently reports `auth_required`; do not treat that as a production outage while Codex is intentionally disabled in settings.
- Remaining real operational gap after findings hit zero:
  - MAGATAMA still knows more assets than it actively telemeters.
  - last public protection proof showed:
    - `knownAssets: 79`
    - `hostsWithTelemetry: 27`
    - `assetsWithoutTelemetry: 52`
  - these are currently inventory/discovery-only assets, not open findings, but they remain the next real coverage expansion area.

- MAGATAMA cross-repo state from the same chat is now synced into this handoff:
  - Compliance framework cards in MAGATAMA are clickable and open per-framework requirement details.
  - MAGATAMA training status was corrected so `New Since Last Training` no longer falsely shows `0`.
  - Live verified/deduped MAGATAMA training state after the fix:
    - `collectedExamples: 49`
    - `rawExamples: 58`
    - `duplicateExamples: 9`
    - `effectiveExamples: 49`
    - `newSinceLastTraining: 49`
  - MAGATAMA now filters training metrics to verified/trainable examples only.
  - Failed/escalated MAGATAMA remediation records should go to `errors.jsonl`, not the main `fixes.jsonl`, so the next MagatamaLLM run does not train on junk.
  - Gitea-backed training pool remains the default target for training writes.
- MAGATAMA coverage-gap and training-integrity hardening on 2026-05-06:
  - the earlier `49` medium `atlas-coverage-gap` findings were traced to Atlas treating inventory-only and discovery-only assets as operational protection failures.
  - core logic was tightened so Atlas coverage findings now open only for managed operational assets:
    - exposure-backed assets
    - explicit non-auto owner
    - configured telemetry expectation
    - critical/high criticality
    - infrastructure metadata or managed infra device types
  - loopback and passive reference/inventory assets no longer reopen noisy guard findings.
  - local build succeeded, the new core dist was deployed to Erik, and the first post-deploy guard scan resolved stale findings.
  - live Postgres state after deploy: `open findings = 0`.
  - training integrity bug was fixed in `packages/core/src/learning/fix-tracking.ts`:
    - verified fixes now append to `training-data/gitea-learning-pool/magatamallm/fixes.jsonl`
    - failed/escalated/report-only runs now belong in `errors.jsonl`
  - two explicit Codex-written training entries were appended to the MAGATAMA Gitea-backed fixes corpus:
    - atlas coverage scope hardening
    - training path integrity fix
  - corpus cleanup + dedupe was executed afterward:
    - pre-dedupe backup kept locally as:
      - `magatama/training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl`
    - resulting verified corpus:
      - `fixes.jsonl = 1,368` unique verified training rows
    - resulting failure corpus:
      - `errors.jsonl = 4` tracked failed/escalated rows
    - integrity report now exists at:
      - `magatama/training-data/gitea-learning-pool/magatamallm/corpus-integrity-report.json`
    - latest integrity totals:
      - `scanned: 1368`
      - `verified: 1368`
      - `movedToErrors: 4`
      - `parseErrors: 0`
      - `invalidVerifiedFlag: 0`
- Complete Codex chat sync was added:
  - `sync/history/2026-04-29-codex-complete-chat-sync.md`
  - captures Ghost/blog updates, LinkedIn voice preferences, LPO/AI-fabric blog edits, Rest-Is-Not-Laziness scheduling replacement, and security notes.
  - confirms no secrets were written into sync.
  - confirms TIP crawler/robot planning remains TIPLLM-only.
  - confirms Erik remains controller/light `erik-safe` only, with heavy crawler work assigned to Proxmox/Pi workers.
- Codex sync-start confirmation was added:
  - `sync/history/2026-04-29-codex-sync-start-confirmation.md`
  - confirms Codex read this TIP handoff, checked the sibling LLM Gateway handoff, and is treating `sync/` as binding.
  - no code changes, crawler jobs, queue waves, PM2 restarts, or Erik load were initiated during this confirmation.
- Codex follow-up on 2026-04-29 clarified the active BlogLLM model:
  - TIP shows `fo-blog-v7`, but this is not a normal Ollama GGUF manifest.
  - It is a local Adapter Bridge / Mac Studio model backed by the RunPod-trained PEFT adapter:
    `/Users/renefichtmueller/Desktop/Claude Code/magatama/training-data/runpod/pod-runs/2026-04-25-fo-tip/final/adapters/fo_blogllm/final-adapter`
  - Bridge definition:
    `/Users/renefichtmueller/Desktop/Claude Code/magatama/scripts/ollama_adapter_bridge.py`
  - TIP API default:
    `packages/api/src/llm/client.ts` uses `OLLAMA_LLM_MODEL || "fo-blog-v7"`.
  - `fo-blog-v8` remains the next training candidate, not the currently active TIP BlogLLM model.
- Full Codex session handoff was added:
  - `sync/history/2026-04-29-codex-full-session-handoff.md`
  - covers TIP verification, product image/detail crawling, Blog Engine Hot Topics, TIPLLM robots, training pool, Erik status, and cross-repo sync.
- Added a verification robot controller:
  - `packages/scraper/src/robots/verification-robots.ts`
  - command: `npm run robots:verification -w packages/scraper -- --status`
- Added TIPLLM robot experience writing:
  - `packages/scraper/src/crawler-llm/training-data-writer.ts`
  - writes raw robot audit rows and SFT records.
- Added Gitea training pool import to TIP learning-pool build:
  - `scripts/tip-learning-pool-build.ts`
  - imports `TIP_TRAINING_REPO/qa-pairs/*.jsonl` into the `tip_llm` lane.
- Added docs:
  - `docs/TIP_SELFLEARNING_WORKFLOW.md`
- Added package script:
  - `packages/scraper/package.json`
  - `robots:verification`

## Gitea Training Pool

- Existing local clone: `/tmp/tip-training-data`
- Gitea repo: `rene/tip-training-data`
- Latest pushed training commit:
  - `f1c83f8 crawl: add robot-status training records [2026-04-29T20:11:24.091Z]`
- First robot experience record was written to:
  - `/tmp/tip-training-data/qa-pairs/robot-control-high.jsonl`
  - `/tmp/tip-training-data/robot-experiences/2026-04-29.jsonl`

## MAGATAMA Training / Operations State

- Relevant local repo:
  - `/Users/renefichtmueller/Desktop/Claude Code/magatama`
- Latest confirmed live MAGATAMA findings state:
  - `open findings: 0` on `2026-05-06`
- Latest confirmed live resolver state:
  - `Codex` and `Copilot` intentionally `idle/disabled`
  - not a runtime outage, but a settings choice until gateway/bridge auth is intentionally re-enabled
- Latest confirmed live MAGATAMA training metric after dashboard fix:
  - `newSinceLastTraining: 49`
- Meaning:
  - the old `0` was incorrect.
  - the currently visible trainable MAGATAMA corpus is based on verified and deduplicated examples only.
- Latest corpus integrity state after cleanup:
  - operational Gitea-backed MAGATAMA training corpus is now much smaller but cleaner:
    - `1368` unique verified rows
    - `4` live failure/escalation rows in `errors.jsonl`
  - do not confuse raw historical volume with real trainable signal.
- Important training integrity rule:
  - report-only or failed/escalated records must not be treated as verified training fixes.
  - keep them separated from the main verified training corpus.

## Erik Status

- Synced TIPLLM robot/training code to `/opt/tip`.
- Did not start crawler jobs.
- Did not enqueue robot waves.
- Did not restart PM2 services.
- Remote scraper TypeScript build is passing after removing two stale misplaced remote-only duplicate files:
  - `/opt/tip/packages/scraper/src/scrapers/scheduler.ts`
  - `/opt/tip/packages/scraper/src/vendor-discovery-crawler.ts`
- `tip-api` and `tip-scraper-daemon` are online.
- Shared Erik note from the same chat:
  - MAGATAMA dashboard/core were redeployed during compliance/training fixes.
  - TIP crawler policy remains unchanged: Erik is controller/light runner only, not heavy crawl execution host.

## Last Live Verification Snapshot

From 2026-04-29:

- Total transceivers: `13,546`
- Price verified: `7,250`
- Image verified: `7,025`
- Details verified: `6,243`
- Fully verified: `5,812`
- Last price observation: `2026-04-29 19:15:53 UTC`
- Last stock observation: `2026-04-29 19:15:56 UTC`

## Latest MAGATAMA Training / RunPod Truth

Confirmed on `2026-05-06`:

- Lane-specific training pools are now materially separated and no longer all fallback to `magatamallm`.
- Live Erik dashboard API now reports:
  - `magatamallm`
    - `1367 train`
    - `152 eval`
    - `1519 total`
    - `newSinceLastTraining = 1367`
  - `fo_blogllm`
    - `17353 train`
    - `1929 eval`
    - `19282 total`
    - `newSinceLastTraining = 17353`
    - active local model resolves to `fo-blog-v7`
  - `tip_llm`
    - `6482 train`
    - `721 eval`
    - `7203 total`
    - `newSinceLastTraining = 6482`
    - target active model is `tip-llm-v1`, but this model is not yet present locally in Ollama
- Result:
  - previous `1097` everywhere was stale / wrong.
  - selected lane now controls its own manifest, model label, and training counts.

### Gitea-backed Pool Materialization

- `magatamallm` Gitea pool remains canonical and populated.
- `fo_blogllm` and `tip_llm` Gitea-backed pool folders were previously almost empty; they are now materialized from the local RunPod lane exports.
- Lane manifests and JSONL exports now exist under:
  - `training-data/gitea-learning-pool/fo_blogllm/`
  - `training-data/gitea-learning-pool/tip_llm/`

### RunPod Completion Hardening

- MAGATAMA dashboard code now treats RunPod `COMPLETED` as success only after:
  1. target model artifact is referenced
  2. local Mac training API adopts/imports the artifact
  3. lane-specific smoke tests pass
  4. active Ollama alias is updated
- New local adoption endpoint is:
  - `POST /adopt-runpod-model`

### Mac Training API State

- The old LaunchAgent on Mac Studio was still serving the legacy training API from:
  - `~/magatama-llm/service/training_api.py`
- It has now been upgraded in place so Erik sees the new adoption-capable API.
- Verified from Erik:
  - `http://192.168.178.213:3214/health` returns the new service
  - it now exposes `register_script` pointing into the MAGATAMA repo
  - `POST /adopt-runpod-model` exists and rejects unauthenticated requests with `401`, proving the route is live

### Still Outstanding

- A fully successful end-to-end RunPod fine-tune with:
  - real worker success
  - real artifact
  - successful local Ollama import
  - active alias switch
  - smoke-test proof
  has not yet been re-verified after the new adoption pipeline was wired in.
- Latest live proof run on `2026-05-06`:
  - job id: `2112a7ab-68c2-4411-a44f-6edb7ad377df-e1`
  - materialized correctly
  - reached `IN_PROGRESS`
  - then `COMPLETED`
  - but RunPod `status/{job}` returned no `output` object, no model artifact reference, and no Hugging Face repo result
  - current MAGATAMA handling now correctly classifies this as `completed_without_model_artifact`, not as success
- `tip_llm-v1` is still not installed locally in Ollama.

### Pulso AI Recommendation

- Keep a shared network/transceiver/switch core corpus with TIP.
- Do not collapse `Pulso AI` into the same instruction lane as `TIP_LLM`.
- Recommended split:
  - `TIP_LLM`
    - research
    - crawler / scraper / robot planning
    - vendor / firmware / issue extraction
  - `Pulso AI`
    - product responses
    - support
    - diagnostics
    - operator explanation layer

## Safe Next Steps

1. Clone or pull Gitea `origin` on laptop/Claude Code.
2. Read this folder first.
3. For BlogLLM work, treat `fo-blog-v7` as Adapter Bridge / PEFT adapter, not as a `~/.ollama` GGUF model.
4. Also read `llm-gateway/sync/CURRENT.md` when work touches shared Erik infrastructure, LLM routing, bridges, auth, TIPLLM, or crawler orchestration.
5. For TIP robot/crawler planning, use TIPLLM only. Do not route this lane through external AI providers.
6. When training pools or model stats look suspicious, prefer verified-only counts and check whether failed/escalated rows polluted the corpus.
7. For MAGATAMA-adjacent work, keep writing learnings back into the Gitea-backed pool and avoid training on report-only pseudo-fixes.
8. If testing robots, start with dry runs only:

```bash
npm run robots:verification -w packages/scraper -- --status
npm run robots:verification -w packages/scraper -- --tipllm-plan --limit=3
npm run robots:verification -w packages/scraper -- --enqueue=details-fast-lane --profile=erik-safe --dry-run
```

9. Only dispatch real crawl work after deciding the target host:
   - Erik: `erik-safe`, tiny batches only.
   - Pi: `pi-fetch`.
   - Proxmox: `proxmox-heavy`.

## Dirty Worktree Note

There are existing uncommitted changes outside `sync/`. Some are Codex work from this session, some appear pre-existing or from earlier Claude/Codex work. Do not blindly revert them. Review `git status --short` before committing broader changes.

## Latest Sync Commits

- `6c42ca7 docs: add shared agent sync handoff`
- `8e7c5aa docs: link llm-gateway sync handoff`
- Pending after this update:
  - watch whether any future guard exposure findings are genuine operational issues or new false positives.
  - if failures still appear inside `fixes.jsonl`, scrub historic pollution and backfill `errors.jsonl`.