transceiver-db/sync/CURRENT.md
2026-05-07 01:16:25 +02:00

625 lines
33 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Current TIP Sync State
Updated: 2026-05-07 01:16 UTC
## Active Policy
- Put coordination notes and handoffs in this `sync/` folder and push to Gitea.
- Check sibling project sync folders first when context may span repos.
- Use TIPLLM only for TIP crawler/robot planning and extraction feedback.
- Write robot/crawler experience into the Gitea-backed TIPLLM training pool.
- Keep Erik safe: no heavy crawler waves or uncontrolled Playwright/discovery jobs on Erik.
- Use Proxmox/Pi workers for crawl load.
## Cross-Repo Sync
Claude Code also created a Gitea sync handoff in the LLM Gateway repo:
- Repo: `rene/llm-gateway`
- Path: `sync/`
- Commit shown by Claude: `e272105 sync: add chat handoff + context scaffolding for Codex integration (2026-04-29)`
- Gitea path: `http://192.168.178.196:3000/rene/llm-gateway/src/main/sync/`
When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infrastructure, read both:
- `transceiver-db/sync/CURRENT.md`
- `llm-gateway/sync/CURRENT.md`
## Latest Work
- MAGATAMA local MagatamaLLM training state was re-verified on 2026-05-07:
- result:
- the lane export / dataset refresh worked
- a new locally adopted MagatamaLLM model did **not** land
- active MAGATAMA provider remains the older alias:
- `ollama:magatama-coder:latest`
- live/public evidence:
- `GET https://magatama.fichtmueller.org/api/llm/status`
- `activeProvider = ollama:magatama-coder:latest`
- `autoFixProvider = ollama:magatama-coder:latest`
- `training.lastTrainingAt = 2026-05-06T22:43:20Z`
- `training.modelVersion = magatama-coder:latest`
- `training.activeRun = null`
- this means the UI timestamp currently reflects the latest dataset/training-state update, not proof of a newly adopted local model.
- local Mac evidence:
- `ollama list` still shows:
- `magatama-coder:latest` → modified `3 weeks ago`
- `magatama-llm-v2-0:latest` → modified `11 days ago`
- no newer Magatama candidate/import alias appeared locally
- registry/adoption evidence:
- Erik lane manifest exists and is fresh:
- `/opt/magatama/training-data/runpod/magatamallm/manifest.json`
- `generatedAt = 2026-05-06T22:45:15.944Z`
- `train = 15679`
- `eval = 1743`
- `total = 17422`
- but Erik had no populated local adoption/registry state files in:
- `/opt/magatama/training-data/model-registry/models.json`
- `/opt/magatama/training-data/model-registry/runs.json`
- `/opt/magatama/training-data/model-registry/active.json`
- `/opt/magatama/data/llm-status.json`
- local repo only had historical `training-data/model-registry/training-runs.json`
- historical run evidence:
- recent `magatamallm` training-run records still show:
- `submitted`
- then `not_found_after_submit`
- or other non-adopted / worker-failure states
- there is still no verified “completed_and_adopted” proof for a new MagatamaLLM local model.
- operational conclusion:
- current truth:
- dataset/lane preparation works
- local model adoption is still the missing step
- MAGATAMA does **not** currently know more than the already active `magatama-coder:latest` alias
- next fix block remains:
- make RunPod/local completion count only when adoption succeeds
- persist adoption report + model registry state
- update active alias and version only after smoke-tested import succeeds
- MAGATAMA Switchblade port intelligence is now truly flowing end-to-end on 2026-05-06:
- live root cause:
- Switchblade itself already had the rich SG350 data (`description`, LLDP neighbor, peer port, octets), but MAGATAMA had still shown mostly flat port chips.
- verified live on Erik:
- the real Switchblade runtime is the PM2 app `switchblade` under `/opt/switchblade-app`, not the older `/opt/switchblade` tree.
- `GET http://127.0.0.1:3000/api/discovery/snmp` for `192.168.178.2` already returned rich rows such as:
- `GigabitEthernet3` → description `Aruba-1830-UNUSED`, neighbor `VN46KYC0G0`, peer port `11`
- `GigabitEthernet5` → description `Tashi-204`, neighbor `fritz.box`, peer `LAN:1`
- `GigabitEthernet25` → description `to Cisco Business 220 Series`, neighbor `Switch39688E`, peer `gi9`
- the remaining loss point was MAGATAMAs own Switchblade sync/persistence path.
- MAGATAMA sync hardening:
- `scripts/switchblade_live_sync.ts`
- now prefers live SNMP discovery data when it is richer than `/api/devices/<ip>`
- now maps `description`, `peerDevice`, `peerPort`, `connectedHost`, `inOctets`, `outOctets` into rack device ports
- added optional debug snapshot dump support via `SWITCHBLADE_DEBUG_SNAPSHOT_FILE`
- sanitizes unreadable peer-port strings and drops synthetic high-index numeric pseudo-ports
- verified with a forced live run on Erik:
- `Top of Rack Switch` now exports `28` real SG350 ports into the rack snapshot instead of the earlier flattened/odd set
- sample verified payloads before POST:
- port 3 → `Aruba-1830-UNUSED` / `VN46KYC0G0` / `11`
- port 5 → `Tashi-204` / `fritz.box` / `LAN:1`
- port 25 → `to Cisco Business 220 Series` / `Switch39688E` / `gi9`
- MAGATAMA core hardening:
- `packages/core/src/routes/health-types.ts`
- `SwitchbladePortSnapshot` now preserves:
- `description`
- `vlan`
- `macCount`
- `peerDevice`
- `peerPort`
- `connectedHost`
- `transceiver`
- `inOctets`
- `outOctets`
- `packages/core/src/routes/health-support.ts`
- `normalizeSwitchbladePort()` now keeps those additional port fields instead of silently truncating them
- rebuilt locally and re-rsynced the new `packages/core/dist` to Erik
- dashboard/UI hardening:
- `packages/dashboard/public/index-v2.html`
- port chips already had custom tooltip support; now they also carry native `title=` fallback text
- this reduces the old “question mark / unclear hover” problem in browsers that do not immediately show the custom bubble
- live public verification after deploy:
- `GET https://magatama.fichtmueller.org/api/switchblade/snapshot`
- now contains enriched SG350 rack-port records with:
- `description`
- `peerDevice`
- `peerPort`
- `connectedHost`
- `inOctets`
- `outOctets`
- public snapshot timestamp verified:
- `receivedAt = 2026-05-06T22:51:59.247Z`
- `Top of Rack Switch` in the public snapshot now exposes meaningful peer/use-case data instead of only flat status counters
- operator impact:
- MAGATAMA can now answer the actual operational question per port:
- what is on this port
- what is it talking to
- what does the link look like
- this is now grounded in Switchblade live SNMP/LLDP data, not guesswork.
- TIP/Blog lane separation was materially corrected on 2026-05-06:
- root cause:
- `TIP_LLM` was still ingesting blog-/writer-shaped rows from the canonical lane pool and shared transceiver corpora.
- local inspection showed the old TIP export had `6250` train rows, of which `6087` still matched blog/writer patterns.
- dataset builder and Gitea sync were hardened:
- `scripts/runpod_dataset_builder.ts`
- added strict `tipDatasetAllowed(...)`
- `TIP_LLM` now rejects blog-shaped source rows at dataset-build time
- `TIP_LLM` now rejects blog-like `system`, `user`, and markdown-article `assistant` patterns
- registry fallback for `TIP_LLM` now only uses lane-compatible datasets
- `scripts/sync_gitea_training_pool.ts`
- canonical TIP pool refresh now uses the stricter lane-alignment rules
- redundant `merged.jsonl` copies for `fo_blogllm` and `tip_llm` are no longer rewritten, to avoid local disk exhaustion from duplicate lane artifacts
- local disk issue encountered and fixed:
- full refresh failed with `ENOSPC` while writing `training-data/gitea-learning-pool/tip_llm/merged.jsonl`
- redundant lane `merged` artifacts for `fo_blogllm` and `tip_llm` were truncated and the sync script was changed to stop recreating them
- free disk space returned from `377Mi` to `17Gi`
- locally verified after rebuild:
- `TIP_LLM` RunPod export:
- `train = 233`
- `eval = 26`
- `total = 259`
- `blog/writer matches = 0`
- first TIP rows now use the correct TIP system prompt:
- `You are TIP_LLM, a research and market-intelligence analyst for transceivers, switches, and vendor ecosystems...`
- corrected artifacts and scripts were synced to Erik and `pnpm training:refresh-all` was rerun there.
- live verified on Erik/public API:
- `magatamallm`
- `datasetSource = url`
- `collectedExamples = 15679`
- `evalExamples = 1743`
- `totalExamples = 17422`
- `newSinceLastTraining = 15679`
- `fo_blogllm`
- `datasetSource = url`
- `collectedExamples = 17322`
- `evalExamples = 1926`
- `totalExamples = 19254`
- `neverTrained = true`
- `tip_llm`
- `datasetSource = url`
- `collectedExamples = 231`
- `evalExamples = 26`
- `totalExamples = 257`
- `neverTrained = true`
- operational conclusion:
- lane-specific dataset truth is now real on Erik.
- `TIP_LLM` is no longer silently borrowing the FO_Blog behavior lane.
- the next remaining hard problem is now RunPod artifact adoption/validation, not lane contamination.
- MAGATAMA frontend/runtime consistency was repaired again on 2026-05-06:
- dashboard and core were rebuilt locally and redeployed to Erik.
- live processes restarted successfully:
- `magatama-dashboard`
- `magatama`
- public `api/llm/status` now shows the true lane-export totals for `magatamallm`:
- `collectedExamples = 15620`
- `effectiveExamples = 15620`
- `evalExamples = 1736`
- `totalExamples = 17356`
- `newSinceLastTraining = 15620`
- root cause for the stale `1097` display:
- the RunPod start SSE path still logged the legacy deduplicated `fixes.jsonl` corpus.
- this was changed so RunPod launches no longer present the legacy `1097` count as the active training truth.
- after dataset refresh the UI now emits the lane manifest totals instead.
- RunPod completion handling was hardened:
- worker `COMPLETED` is no longer trusted blindly.
- MAGATAMA now scans RunPod worker logs for real training failures (`Traceback`, `SyntaxError`, non-zero exit, etc.) before treating the run as successful.
- if the worker logs show a hidden failure, MAGATAMA records this as `completed_with_worker_failure` instead of pretending the run succeeded.
- public findings state remains currently empty:
- `GET /api/findings?limit=1` returned `{"findings":[],"total":0}`
- this is now rendered with an explicit empty-state row instead of a visually blank table.
- Attack Paths empty-state is now intentionally explicit rather than looking broken.
- Frontend cache and scope handling were hardened:
- cache version bumped to `2026-05-06b`
- stale legacy `magatama_api_cache:*` entries are cleared
- per-endpoint TTLs added
- invalid or empty scope selections are normalized instead of silently leaving the UI in misleading empty views
- Switchblade rack port hover was materially improved:
- port chips now carry `data-tooltip`
- custom tooltip CSS is live on Erik
- the old browser-native “question mark only” behavior should be replaced by a readable hover bubble
- Changelog self-healing was added in core:
- stale cached changelog data older than 6h now forces a rebuild from git history
- verified live via dashboard proxy on Erik:
- `generatedAt = 2026-05-06T15:18:42.708Z`
- latest visible entries include `2026-04-30` items again instead of appearing frozen at `30.05`
- MAGATAMA lane-specific training pools and RunPod dataset automation were finished on 2026-05-06:
- root cause:
- the training modal always fetched `/api/llm/status` without a lane, so `FO_BlogLLM` and `TIP_LLM` still showed the `magatamallm` pool.
- dashboard/server were updated so `/api/llm/status?lane=...` is now truly lane-aware.
- the training modal now refreshes per selected lane and rewrites:
- title
- runtime label
- pool path
- counts
- dataset source
- MAGATAMA dashboard env on Erik was switched to URL dataset mode for all lanes via `ecosystem.config.cjs`:
- `RUNPOD_DATASET_SOURCE=url`
- `RUNPOD_DATASET_SOURCE_MAGATAMALLM=url`
- `RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url`
- `RUNPOD_DATASET_SOURCE_TIP_LLM=url`
- live verified on Erik after restart:
- `fo_blogllm`
- `datasetSource = url`
- `collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json`
- `train = 28`
- `eval = 4`
- `total = 32`
- `tip_llm`
- `datasetSource = url`
- `collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json`
- `train = 36`
- `eval = 4`
- `total = 40`
- `magatamallm`
- remains on lane-export counts (`15620 / 1736 / 17356`)
- operator impact:
- no Hugging Face dataset publish is required anymore for MAGATAMA RunPod launches.
- every supported LLM lane now points to its own local/Gitea-backed lane export instead of reusing `magatamallm`.
- MAGATAMA training + Attack Paths + Atlas exposure were corrected again on 2026-05-06:
- the RunPod serverless training start failure was not a RunPod outage.
- root cause was missing training scripts on Erik (`training_full_refresh.ts` and related helpers were absent under `/opt/magatama/scripts`).
- Codex synced the full local `magatama/scripts/` tree to Erik, added a safe fallback in `scripts/model_registry_build.ts`, and synced the local `training-data/model-registry/` directory.
- verified on Erik:
- `pnpm training:refresh-all` now succeeds.
- fresh dataset totals after dedupe:
- `magatamallm`: `92,742` raw → `17,356` effective (`15,620 train / 1,736 eval`)
- `fo_blogllm`: `32` total (`28 train / 4 eval`)
- `tip_llm`: `40` total (`36 train / 4 eval`)
- important nuance:
- Codex did **not** execute the final Hugging Face publish step from Erik in this chat.
- local/script/build failures are fixed; external dataset publish still depends on the selected dataset source and explicit publish intent.
- MAGATAMA Attack Paths UX is no longer a misleading blank panel:
- the page now distinguishes between:
- no live attack paths
- historical fallback paths
- empty selected scope (`0 assets in scope`)
- when a user narrows the scope to a rack/location with zero scoped assets, the graph explicitly says so instead of looking broken.
- live dashboard HTML on Erik now contains:
- `Im aktuellen Scope liegen 0 Assets.`
- `Erweitere Standort oder Datacenter / Rack, damit MAGATAMA korrelierbare Assets und Pfade darstellen kann.`
- `Ohne offene mehrstufige Korrelationen bleibt die Graph-Sicht bewusst leer.`
- MAGATAMA code/training hardening was extended:
- `scripts/test_runpod_adapter.py` no longer loads tokenizer/model with `trust_remote_code=True`.
- `scripts/ollama_adapter_bridge.py` no longer loads tokenizer/model with `trust_remote_code=True`.
- this removed the live CODE finding around `HuggingFace trust_remote_code` on Erik.
- Atlas exposure logic was tightened to stop reopening noisy LAN management findings:
- generic `atlas-exposure` findings now only stay operationally open for exposure that is meaningful enough to track as a finding.
- internal RFC1918 management/service ports discovered by the broad atlas scan are no longer promoted into open Guard findings just because they exist on the LAN.
- host-specific posture for Proxmox / Erik / Mac Studio remains the job of explicit host-audit logic.
- after rebuild + deploy + health sync:
- live Postgres open findings returned to `0`.
- Follow-up hardening on the same block:
- the earlier RunPod error path in MAGATAMA dashboard was made more truthful.
- dataset preparation now distinguishes:
- local `training:refresh-all` failure
- optional Hugging Face publish failure
- URL-based dataset mode with no external publish required
- the training SSE flow now explicitly tells the operator whether RunPod is using:
- Hugging Face dataset source
- or MAGATAMA URL-bundle dataset source
- this avoids misleading `RunPod not reachable` wording when the actual failure is in dataset preparation.
- follow-up serverless verification on 2026-05-06 narrowed the remaining fault further:
- MAGATAMA submit logic now verifies that a RunPod job really exists under `/status/{jobId}` instead of trusting `/run`.
- payloads were aligned more closely with the official Axolotl serverless schema:
- `model_type=AutoModelForCausalLM`
- `tokenizer_type=AutoTokenizer`
- dataset `split: train`
- optimizer `adamw_torch_fused`
- verified full run attempt:
- job id `9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2`
- disappeared as `not_found_after_submit` (`404 job not found`)
- verified canary after payload fix:
- job id `a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2`
- immediately materialized as `IN_QUEUE`
- then still disappeared on later reconcile as `not_found_after_submit`
- current conclusion:
- the old MAGATAMA bug is fixed.
- the remaining problem is now likely on the RunPod endpoint/release side: jobs are accepted and briefly queued, but do not survive long enough to produce a durable serverless status lifecycle.
- operational rule:
- do not treat `submitted` or a brief `IN_QUEUE` as proof of a usable serverless training run.
- only trust the run once it reaches `IN_PROGRESS` or a durable terminal state with artifact evidence.
- follow-up training count fix on 2026-05-06 corrected the Training UI source-of-truth:
- MAGATAMA had still shown `1097` because the dashboard was counting the legacy deduplicated fix corpus instead of the current lane-specific RunPod export.
- dashboard now prefers `training-data/runpod/magatamallm/manifest.json` for the visible MagatamaLLM training count.
- synced current lane export to Erik and restarted `magatama-dashboard`.
- verified public API now returns:
- `collectedExamples = 1367`
- `effectiveExamples = 1367`
- `evalExamples = 152`
- `totalExamples = 1519`
- `newSinceLastTraining = 1367`
- if the browser still shows `1097`, treat it as stale cached UI and hard reload.
- MAGATAMA was repaired end-to-end to a clean operational baseline:
- live guard host-audits for Erik, Mac Studio, and Proxmox were corrected and rerun.
- open findings were reduced all the way to `0` in Postgres.
- false-positive Proxmox baseline findings were removed by teaching the audit to treat internal-only management ports and default-only rpcbind exposure as acceptable for this host.
- code scanner false positives from generated/report artifacts remain excluded.
- Live MAGATAMA protection/runtime state after the 2026-05-06 remediation:
- `open findings: 0`
- `queueExecuting: 0`
- `queueBlocked: 0`
- `queueFailed: 0`
- public `/api/health` returns `status: ok`
- public `/api/active-resolvers` returns:
- `MAGATAMA Core: working`
- `MagatamaLLM: working`
- `Claude (secondary): working`
- `Codex (secondary/manual): idle`
- `Copilot (secondary/manual): idle`
- Important resolver truth fix on 2026-05-06:
- live `codex_enabled=false` in MAGATAMA settings was causing Codex to show as a broken resolver.
- dashboard logic was updated so disabled Codex/Copilot now show truthfully as `idle` with `In MAGATAMA settings disabled`, instead of pretending there is a runtime outage.
- the local codex bridge on Erik is reachable but currently reports `auth_required`; do not treat that as a production outage while Codex is intentionally disabled in settings.
- Remaining real operational gap after findings hit zero:
- MAGATAMA still knows more assets than it actively telemeters.
- last public protection proof showed:
- `knownAssets: 79`
- `hostsWithTelemetry: 27`
- `assetsWithoutTelemetry: 52`
- these are currently inventory/discovery-only assets, not open findings, but they remain the next real coverage expansion area.
- MAGATAMA cross-repo state from the same chat is now synced into this handoff:
- Compliance framework cards in MAGATAMA are clickable and open per-framework requirement details.
- MAGATAMA training status was corrected so `New Since Last Training` no longer falsely shows `0`.
- Live verified/deduped MAGATAMA training state after the fix:
- `collectedExamples: 49`
- `rawExamples: 58`
- `duplicateExamples: 9`
- `effectiveExamples: 49`
- `newSinceLastTraining: 49`
- MAGATAMA now filters training metrics to verified/trainable examples only.
- Failed/escalated MAGATAMA remediation records should go to `errors.jsonl`, not the main `fixes.jsonl`, so the next MagatamaLLM run does not train on junk.
- Gitea-backed training pool remains the default target for training writes.
- MAGATAMA coverage-gap and training-integrity hardening on 2026-05-06:
- the earlier `49` medium `atlas-coverage-gap` findings were traced to Atlas treating inventory-only and discovery-only assets as operational protection failures.
- core logic was tightened so Atlas coverage findings now open only for managed operational assets:
- exposure-backed assets
- explicit non-auto owner
- configured telemetry expectation
- critical/high criticality
- infrastructure metadata or managed infra device types
- loopback and passive reference/inventory assets no longer reopen noisy guard findings.
- local build succeeded, the new core dist was deployed to Erik, and the first post-deploy guard scan resolved stale findings.
- live Postgres state after deploy: `open findings = 0`.
- training integrity bug was fixed in `packages/core/src/learning/fix-tracking.ts`:
- verified fixes now append to `training-data/gitea-learning-pool/magatamallm/fixes.jsonl`
- failed/escalated/report-only runs now belong in `errors.jsonl`
- two explicit Codex-written training entries were appended to the MAGATAMA Gitea-backed fixes corpus:
- atlas coverage scope hardening
- training path integrity fix
- corpus cleanup + dedupe was executed afterward:
- pre-dedupe backup kept locally as:
- `magatama/training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl`
- resulting verified corpus:
- `fixes.jsonl = 1,368` unique verified training rows
- resulting failure corpus:
- `errors.jsonl = 4` tracked failed/escalated rows
- integrity report now exists at:
- `magatama/training-data/gitea-learning-pool/magatamallm/corpus-integrity-report.json`
- latest integrity totals:
- `scanned: 1368`
- `verified: 1368`
- `movedToErrors: 4`
- `parseErrors: 0`
- `invalidVerifiedFlag: 0`
- Complete Codex chat sync was added:
- `sync/history/2026-04-29-codex-complete-chat-sync.md`
- captures Ghost/blog updates, LinkedIn voice preferences, LPO/AI-fabric blog edits, Rest-Is-Not-Laziness scheduling replacement, and security notes.
- confirms no secrets were written into sync.
- confirms TIP crawler/robot planning remains TIPLLM-only.
- confirms Erik remains controller/light `erik-safe` only, with heavy crawler work assigned to Proxmox/Pi workers.
- Codex sync-start confirmation was added:
- `sync/history/2026-04-29-codex-sync-start-confirmation.md`
- confirms Codex read this TIP handoff, checked the sibling LLM Gateway handoff, and is treating `sync/` as binding.
- no code changes, crawler jobs, queue waves, PM2 restarts, or Erik load were initiated during this confirmation.
- Codex follow-up on 2026-04-29 clarified the active BlogLLM model:
- TIP shows `fo-blog-v7`, but this is not a normal Ollama GGUF manifest.
- It is a local Adapter Bridge / Mac Studio model backed by the RunPod-trained PEFT adapter:
`/Users/renefichtmueller/Desktop/Claude Code/magatama/training-data/runpod/pod-runs/2026-04-25-fo-tip/final/adapters/fo_blogllm/final-adapter`
- Bridge definition:
`/Users/renefichtmueller/Desktop/Claude Code/magatama/scripts/ollama_adapter_bridge.py`
- TIP API default:
`packages/api/src/llm/client.ts` uses `OLLAMA_LLM_MODEL || "fo-blog-v7"`.
- `fo-blog-v8` remains the next training candidate, not the currently active TIP BlogLLM model.
- Full Codex session handoff was added:
- `sync/history/2026-04-29-codex-full-session-handoff.md`
- covers TIP verification, product image/detail crawling, Blog Engine Hot Topics, TIPLLM robots, training pool, Erik status, and cross-repo sync.
- Added a verification robot controller:
- `packages/scraper/src/robots/verification-robots.ts`
- command: `npm run robots:verification -w packages/scraper -- --status`
- Added TIPLLM robot experience writing:
- `packages/scraper/src/crawler-llm/training-data-writer.ts`
- writes raw robot audit rows and SFT records.
- Added Gitea training pool import to TIP learning-pool build:
- `scripts/tip-learning-pool-build.ts`
- imports `TIP_TRAINING_REPO/qa-pairs/*.jsonl` into the `tip_llm` lane.
- Added docs:
- `docs/TIP_SELFLEARNING_WORKFLOW.md`
- Added package script:
- `packages/scraper/package.json`
- `robots:verification`
## Gitea Training Pool
- Existing local clone: `/tmp/tip-training-data`
- Gitea repo: `rene/tip-training-data`
- Latest pushed training commit:
- `f1c83f8 crawl: add robot-status training records [2026-04-29T20:11:24.091Z]`
- First robot experience record was written to:
- `/tmp/tip-training-data/qa-pairs/robot-control-high.jsonl`
- `/tmp/tip-training-data/robot-experiences/2026-04-29.jsonl`
## MAGATAMA Training / Operations State
- Relevant local repo:
- `/Users/renefichtmueller/Desktop/Claude Code/magatama`
- Latest confirmed live MAGATAMA findings state:
- `open findings: 0` on `2026-05-06`
- Latest confirmed live resolver state:
- `Codex` and `Copilot` intentionally `idle/disabled`
- not a runtime outage, but a settings choice until gateway/bridge auth is intentionally re-enabled
- Latest confirmed live MAGATAMA training metric after dashboard fix:
- `newSinceLastTraining: 49`
- Meaning:
- the old `0` was incorrect.
- the currently visible trainable MAGATAMA corpus is based on verified and deduplicated examples only.
- Latest corpus integrity state after cleanup:
- operational Gitea-backed MAGATAMA training corpus is now much smaller but cleaner:
- `1368` unique verified rows
- `4` live failure/escalation rows in `errors.jsonl`
- do not confuse raw historical volume with real trainable signal.
- Important training integrity rule:
- report-only or failed/escalated records must not be treated as verified training fixes.
- keep them separated from the main verified training corpus.
## Erik Status
- Synced TIPLLM robot/training code to `/opt/tip`.
- Did not start crawler jobs.
- Did not enqueue robot waves.
- Did not restart PM2 services.
- Remote scraper TypeScript build is passing after removing two stale misplaced remote-only duplicate files:
- `/opt/tip/packages/scraper/src/scrapers/scheduler.ts`
- `/opt/tip/packages/scraper/src/vendor-discovery-crawler.ts`
- `tip-api` and `tip-scraper-daemon` are online.
- Shared Erik note from the same chat:
- MAGATAMA dashboard/core were redeployed during compliance/training fixes.
- TIP crawler policy remains unchanged: Erik is controller/light runner only, not heavy crawl execution host.
## Last Live Verification Snapshot
From 2026-04-29:
- Total transceivers: `13,546`
- Price verified: `7,250`
- Image verified: `7,025`
- Details verified: `6,243`
- Fully verified: `5,812`
- Last price observation: `2026-04-29 19:15:53 UTC`
- Last stock observation: `2026-04-29 19:15:56 UTC`
## Latest MAGATAMA Training / RunPod Truth
Confirmed on `2026-05-06`:
- Lane-specific training pools are now materially separated and no longer all fallback to `magatamallm`.
- Live Erik dashboard API now reports:
- `magatamallm`
- `1367 train`
- `152 eval`
- `1519 total`
- `newSinceLastTraining = 1367`
- `fo_blogllm`
- `17353 train`
- `1929 eval`
- `19282 total`
- `newSinceLastTraining = 17353`
- active local model resolves to `fo-blog-v7`
- `tip_llm`
- `6482 train`
- `721 eval`
- `7203 total`
- `newSinceLastTraining = 6482`
- target active model is `tip-llm-v1`, but this model is not yet present locally in Ollama
- Result:
- previous `1097` everywhere was stale / wrong.
- selected lane now controls its own manifest, model label, and training counts.
### Gitea-backed Pool Materialization
- `magatamallm` Gitea pool remains canonical and populated.
- `fo_blogllm` and `tip_llm` Gitea-backed pool folders were previously almost empty; they are now materialized from the local RunPod lane exports.
- Lane manifests and JSONL exports now exist under:
- `training-data/gitea-learning-pool/fo_blogllm/`
- `training-data/gitea-learning-pool/tip_llm/`
### RunPod Completion Hardening
- MAGATAMA dashboard code now treats RunPod `COMPLETED` as success only after:
1. target model artifact is referenced
2. local Mac training API adopts/imports the artifact
3. lane-specific smoke tests pass
4. active Ollama alias is updated
- New local adoption endpoint is:
- `POST /adopt-runpod-model`
### Mac Training API State
- The old LaunchAgent on Mac Studio was still serving the legacy training API from:
- `~/magatama-llm/service/training_api.py`
- It has now been upgraded in place so Erik sees the new adoption-capable API.
- Verified from Erik:
- `http://192.168.178.213:3214/health` returns the new service
- it now exposes `register_script` pointing into the MAGATAMA repo
- `POST /adopt-runpod-model` exists and rejects unauthenticated requests with `401`, proving the route is live
### Still Outstanding
- A fully successful end-to-end RunPod fine-tune with:
- real worker success
- real artifact
- successful local Ollama import
- active alias switch
- smoke-test proof
has not yet been re-verified after the new adoption pipeline was wired in.
- Latest live proof run on `2026-05-06`:
- job id: `2112a7ab-68c2-4411-a44f-6edb7ad377df-e1`
- materialized correctly
- reached `IN_PROGRESS`
- then `COMPLETED`
- but RunPod `status/{job}` returned no `output` object, no model artifact reference, and no Hugging Face repo result
- current MAGATAMA handling now correctly classifies this as `completed_without_model_artifact`, not as success
- `tip_llm-v1` is still not installed locally in Ollama.
### Pulso AI Recommendation
- Keep a shared network/transceiver/switch core corpus with TIP.
- Do not collapse `Pulso AI` into the same instruction lane as `TIP_LLM`.
- Recommended split:
- `TIP_LLM`
- research
- crawler / scraper / robot planning
- vendor / firmware / issue extraction
- `Pulso AI`
- product responses
- support
- diagnostics
- operator explanation layer
## Safe Next Steps
1. Clone or pull Gitea `origin` on laptop/Claude Code.
2. Read this folder first.
3. For BlogLLM work, treat `fo-blog-v7` as Adapter Bridge / PEFT adapter, not as a `~/.ollama` GGUF model.
4. Also read `llm-gateway/sync/CURRENT.md` when work touches shared Erik infrastructure, LLM routing, bridges, auth, TIPLLM, or crawler orchestration.
5. For TIP robot/crawler planning, use TIPLLM only. Do not route this lane through external AI providers.
6. When training pools or model stats look suspicious, prefer verified-only counts and check whether failed/escalated rows polluted the corpus.
7. For MAGATAMA-adjacent work, keep writing learnings back into the Gitea-backed pool and avoid training on report-only pseudo-fixes.
8. If testing robots, start with dry runs only:
```bash
npm run robots:verification -w packages/scraper -- --status
npm run robots:verification -w packages/scraper -- --tipllm-plan --limit=3
npm run robots:verification -w packages/scraper -- --enqueue=details-fast-lane --profile=erik-safe --dry-run
```
9. Only dispatch real crawl work after deciding the target host:
- Erik: `erik-safe`, tiny batches only.
- Pi: `pi-fetch`.
- Proxmox: `proxmox-heavy`.
## Dirty Worktree Note
There are existing uncommitted changes outside `sync/`. Some are Codex work from this session, some appear pre-existing or from earlier Claude/Codex work. Do not blindly revert them. Review `git status --short` before committing broader changes.
## Latest Sync Commits
- `6c42ca7 docs: add shared agent sync handoff`
- `8e7c5aa docs: link llm-gateway sync handoff`
- Pending after this update:
- watch whether any future guard exposure findings are genuine operational issues or new false positives.
- if failures still appear inside `fixes.jsonl`, scrub historic pollution and backfill `errors.jsonl`.