rene/transceiver-db

Fork 0

Rene Fichtmueller e6f98c89bd sync: record magatama runpod adoption and lane truth

2026-05-06 20:23:53 +02:00

24 KiB

Raw Blame History

Current TIP Sync State

Updated: 2026-05-06 15:48 UTC

Active Policy

Put coordination notes and handoffs in this sync/ folder and push to Gitea.
Check sibling project sync folders first when context may span repos.
Use TIPLLM only for TIP crawler/robot planning and extraction feedback.
Write robot/crawler experience into the Gitea-backed TIPLLM training pool.
Keep Erik safe: no heavy crawler waves or uncontrolled Playwright/discovery jobs on Erik.
Use Proxmox/Pi workers for crawl load.

Cross-Repo Sync

Claude Code also created a Gitea sync handoff in the LLM Gateway repo:

Repo: rene/llm-gateway
Path: sync/
Commit shown by Claude: e272105 sync: add chat handoff + context scaffolding for Codex integration (2026-04-29)
Gitea path: http://192.168.178.196:3000/rene/llm-gateway/src/main/sync/

When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infrastructure, read both:

transceiver-db/sync/CURRENT.md
llm-gateway/sync/CURRENT.md

Latest Work

MAGATAMA frontend/runtime consistency was repaired again on 2026-05-06:
- dashboard and core were rebuilt locally and redeployed to Erik.
- live processes restarted successfully:
  - magatama-dashboard
  - magatama
- public api/llm/status now shows the true lane-export totals for magatamallm:
  - collectedExamples = 15620
  - effectiveExamples = 15620
  - evalExamples = 1736
  - totalExamples = 17356
  - newSinceLastTraining = 15620
- root cause for the stale 1097 display:
  - the RunPod start SSE path still logged the legacy deduplicated fixes.jsonl corpus.
  - this was changed so RunPod launches no longer present the legacy 1097 count as the active training truth.
  - after dataset refresh the UI now emits the lane manifest totals instead.
- RunPod completion handling was hardened:
  - worker COMPLETED is no longer trusted blindly.
  - MAGATAMA now scans RunPod worker logs for real training failures (Traceback, SyntaxError, non-zero exit, etc.) before treating the run as successful.
  - if the worker logs show a hidden failure, MAGATAMA records this as completed_with_worker_failure instead of pretending the run succeeded.
- public findings state remains currently empty:
  - GET /api/findings?limit=1 returned {"findings":[],"total":0}
  - this is now rendered with an explicit empty-state row instead of a visually blank table.
- Attack Paths empty-state is now intentionally explicit rather than looking broken.
- Frontend cache and scope handling were hardened:
  - cache version bumped to 2026-05-06b
  - stale legacy magatama_api_cache:* entries are cleared
  - per-endpoint TTLs added
  - invalid or empty scope selections are normalized instead of silently leaving the UI in misleading empty views
- Switchblade rack port hover was materially improved:
  - port chips now carry data-tooltip
  - custom tooltip CSS is live on Erik
  - the old browser-native “question mark only” behavior should be replaced by a readable hover bubble
- Changelog self-healing was added in core:
  - stale cached changelog data older than 6h now forces a rebuild from git history
  - verified live via dashboard proxy on Erik:
    - generatedAt = 2026-05-06T15:18:42.708Z
    - latest visible entries include 2026-04-30 items again instead of appearing frozen at 30.05
MAGATAMA lane-specific training pools and RunPod dataset automation were finished on 2026-05-06:
- root cause:
  - the training modal always fetched /api/llm/status without a lane, so FO_BlogLLM and TIP_LLM still showed the magatamallm pool.
- dashboard/server were updated so /api/llm/status?lane=... is now truly lane-aware.
- the training modal now refreshes per selected lane and rewrites:
  - title
  - runtime label
  - pool path
  - counts
  - dataset source
- MAGATAMA dashboard env on Erik was switched to URL dataset mode for all lanes via ecosystem.config.cjs:
  - RUNPOD_DATASET_SOURCE=url
  - RUNPOD_DATASET_SOURCE_MAGATAMALLM=url
  - RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url
  - RUNPOD_DATASET_SOURCE_TIP_LLM=url
- live verified on Erik after restart:
  - fo_blogllm
    - datasetSource = url
    - collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json
    - train = 28
    - eval = 4
    - total = 32
  - tip_llm
    - datasetSource = url
    - collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json
    - train = 36
    - eval = 4
    - total = 40
  - magatamallm
    - remains on lane-export counts (15620 / 1736 / 17356)
- operator impact:
  - no Hugging Face dataset publish is required anymore for MAGATAMA RunPod launches.
  - every supported LLM lane now points to its own local/Gitea-backed lane export instead of reusing magatamallm.
MAGATAMA training + Attack Paths + Atlas exposure were corrected again on 2026-05-06:
- the RunPod serverless training start failure was not a RunPod outage.
- root cause was missing training scripts on Erik (training_full_refresh.ts and related helpers were absent under /opt/magatama/scripts).
- Codex synced the full local magatama/scripts/ tree to Erik, added a safe fallback in scripts/model_registry_build.ts, and synced the local training-data/model-registry/ directory.
- verified on Erik:
  - pnpm training:refresh-all now succeeds.
  - fresh dataset totals after dedupe:
    - magatamallm: 92,742 raw → 17,356 effective (15,620 train / 1,736 eval)
    - fo_blogllm: 32 total (28 train / 4 eval)
    - tip_llm: 40 total (36 train / 4 eval)
- important nuance:
  - Codex did not execute the final Hugging Face publish step from Erik in this chat.
  - local/script/build failures are fixed; external dataset publish still depends on the selected dataset source and explicit publish intent.
MAGATAMA Attack Paths UX is no longer a misleading blank panel:
- the page now distinguishes between:
  - no live attack paths
  - historical fallback paths
  - empty selected scope (0 assets in scope)
- when a user narrows the scope to a rack/location with zero scoped assets, the graph explicitly says so instead of looking broken.
- live dashboard HTML on Erik now contains:
  - Im aktuellen Scope liegen 0 Assets.
  - Erweitere Standort oder Datacenter / Rack, damit MAGATAMA korrelierbare Assets und Pfade darstellen kann.
  - Ohne offene mehrstufige Korrelationen bleibt die Graph-Sicht bewusst leer.
MAGATAMA code/training hardening was extended:
- scripts/test_runpod_adapter.py no longer loads tokenizer/model with trust_remote_code=True.
- scripts/ollama_adapter_bridge.py no longer loads tokenizer/model with trust_remote_code=True.
- this removed the live CODE finding around HuggingFace trust_remote_code on Erik.
Atlas exposure logic was tightened to stop reopening noisy LAN management findings:
- generic atlas-exposure findings now only stay operationally open for exposure that is meaningful enough to track as a finding.
- internal RFC1918 management/service ports discovered by the broad atlas scan are no longer promoted into open Guard findings just because they exist on the LAN.
- host-specific posture for Proxmox / Erik / Mac Studio remains the job of explicit host-audit logic.
- after rebuild + deploy + health sync:
  - live Postgres open findings returned to 0.
Follow-up hardening on the same block:
- the earlier RunPod error path in MAGATAMA dashboard was made more truthful.
- dataset preparation now distinguishes:
  - local training:refresh-all failure
  - optional Hugging Face publish failure
  - URL-based dataset mode with no external publish required
- the training SSE flow now explicitly tells the operator whether RunPod is using:
  - Hugging Face dataset source
  - or MAGATAMA URL-bundle dataset source
- this avoids misleading RunPod not reachable wording when the actual failure is in dataset preparation.
- follow-up serverless verification on 2026-05-06 narrowed the remaining fault further:
  - MAGATAMA submit logic now verifies that a RunPod job really exists under /status/{jobId} instead of trusting /run.
  - payloads were aligned more closely with the official Axolotl serverless schema:
    - model_type=AutoModelForCausalLM
    - tokenizer_type=AutoTokenizer
    - dataset split: train
    - optimizer adamw_torch_fused
  - verified full run attempt:
    - job id 9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2
    - disappeared as not_found_after_submit (404 job not found)
  - verified canary after payload fix:
    - job id a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2
    - immediately materialized as IN_QUEUE
    - then still disappeared on later reconcile as not_found_after_submit
  - current conclusion:
    - the old MAGATAMA bug is fixed.
    - the remaining problem is now likely on the RunPod endpoint/release side: jobs are accepted and briefly queued, but do not survive long enough to produce a durable serverless status lifecycle.
  - operational rule:
    - do not treat submitted or a brief IN_QUEUE as proof of a usable serverless training run.
    - only trust the run once it reaches IN_PROGRESS or a durable terminal state with artifact evidence.
- follow-up training count fix on 2026-05-06 corrected the Training UI source-of-truth:
  - MAGATAMA had still shown 1097 because the dashboard was counting the legacy deduplicated fix corpus instead of the current lane-specific RunPod export.
  - dashboard now prefers training-data/runpod/magatamallm/manifest.json for the visible MagatamaLLM training count.
  - synced current lane export to Erik and restarted magatama-dashboard.
  - verified public API now returns:
    - collectedExamples = 1367
    - effectiveExamples = 1367
    - evalExamples = 152
    - totalExamples = 1519
    - newSinceLastTraining = 1367
  - if the browser still shows 1097, treat it as stale cached UI and hard reload.
MAGATAMA was repaired end-to-end to a clean operational baseline:
- live guard host-audits for Erik, Mac Studio, and Proxmox were corrected and rerun.
- open findings were reduced all the way to 0 in Postgres.
- false-positive Proxmox baseline findings were removed by teaching the audit to treat internal-only management ports and default-only rpcbind exposure as acceptable for this host.
- code scanner false positives from generated/report artifacts remain excluded.
Live MAGATAMA protection/runtime state after the 2026-05-06 remediation:
- open findings: 0
- queueExecuting: 0
- queueBlocked: 0
- queueFailed: 0
- public /api/health returns status: ok
- public /api/active-resolvers returns:
  - MAGATAMA Core: working
  - MagatamaLLM: working
  - Claude (secondary): working
  - Codex (secondary/manual): idle
  - Copilot (secondary/manual): idle
Important resolver truth fix on 2026-05-06:
- live codex_enabled=false in MAGATAMA settings was causing Codex to show as a broken resolver.
- dashboard logic was updated so disabled Codex/Copilot now show truthfully as idle with In MAGATAMA settings disabled, instead of pretending there is a runtime outage.
- the local codex bridge on Erik is reachable but currently reports auth_required; do not treat that as a production outage while Codex is intentionally disabled in settings.
Remaining real operational gap after findings hit zero:
- MAGATAMA still knows more assets than it actively telemeters.
- last public protection proof showed:
  - knownAssets: 79
  - hostsWithTelemetry: 27
  - assetsWithoutTelemetry: 52
- these are currently inventory/discovery-only assets, not open findings, but they remain the next real coverage expansion area.
MAGATAMA cross-repo state from the same chat is now synced into this handoff:
- Compliance framework cards in MAGATAMA are clickable and open per-framework requirement details.
- MAGATAMA training status was corrected so New Since Last Training no longer falsely shows 0.
- Live verified/deduped MAGATAMA training state after the fix:
  - collectedExamples: 49
  - rawExamples: 58
  - duplicateExamples: 9
  - effectiveExamples: 49
  - newSinceLastTraining: 49
- MAGATAMA now filters training metrics to verified/trainable examples only.
- Failed/escalated MAGATAMA remediation records should go to errors.jsonl, not the main fixes.jsonl, so the next MagatamaLLM run does not train on junk.
- Gitea-backed training pool remains the default target for training writes.
MAGATAMA coverage-gap and training-integrity hardening on 2026-05-06:
- the earlier 49 medium atlas-coverage-gap findings were traced to Atlas treating inventory-only and discovery-only assets as operational protection failures.
- core logic was tightened so Atlas coverage findings now open only for managed operational assets:
  - exposure-backed assets
  - explicit non-auto owner
  - configured telemetry expectation
  - critical/high criticality
  - infrastructure metadata or managed infra device types
- loopback and passive reference/inventory assets no longer reopen noisy guard findings.
- local build succeeded, the new core dist was deployed to Erik, and the first post-deploy guard scan resolved stale findings.
- live Postgres state after deploy: open findings = 0.
- training integrity bug was fixed in packages/core/src/learning/fix-tracking.ts:
  - verified fixes now append to training-data/gitea-learning-pool/magatamallm/fixes.jsonl
  - failed/escalated/report-only runs now belong in errors.jsonl
- two explicit Codex-written training entries were appended to the MAGATAMA Gitea-backed fixes corpus:
  - atlas coverage scope hardening
  - training path integrity fix
- corpus cleanup + dedupe was executed afterward:
  - pre-dedupe backup kept locally as:
    - magatama/training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl
  - resulting verified corpus:
    - fixes.jsonl = 1,368 unique verified training rows
  - resulting failure corpus:
    - errors.jsonl = 4 tracked failed/escalated rows
  - integrity report now exists at:
    - magatama/training-data/gitea-learning-pool/magatamallm/corpus-integrity-report.json
  - latest integrity totals:
    - scanned: 1368
    - verified: 1368
    - movedToErrors: 4
    - parseErrors: 0
    - invalidVerifiedFlag: 0
Complete Codex chat sync was added:
- sync/history/2026-04-29-codex-complete-chat-sync.md
- captures Ghost/blog updates, LinkedIn voice preferences, LPO/AI-fabric blog edits, Rest-Is-Not-Laziness scheduling replacement, and security notes.
- confirms no secrets were written into sync.
- confirms TIP crawler/robot planning remains TIPLLM-only.
- confirms Erik remains controller/light erik-safe only, with heavy crawler work assigned to Proxmox/Pi workers.
Codex sync-start confirmation was added:
- sync/history/2026-04-29-codex-sync-start-confirmation.md
- confirms Codex read this TIP handoff, checked the sibling LLM Gateway handoff, and is treating sync/ as binding.
- no code changes, crawler jobs, queue waves, PM2 restarts, or Erik load were initiated during this confirmation.
Codex follow-up on 2026-04-29 clarified the active BlogLLM model:
- TIP shows fo-blog-v7, but this is not a normal Ollama GGUF manifest.
- It is a local Adapter Bridge / Mac Studio model backed by the RunPod-trained PEFT adapter: /Users/renefichtmueller/Desktop/Claude Code/magatama/training-data/runpod/pod-runs/2026-04-25-fo-tip/final/adapters/fo_blogllm/final-adapter
- Bridge definition: /Users/renefichtmueller/Desktop/Claude Code/magatama/scripts/ollama_adapter_bridge.py
- TIP API default: packages/api/src/llm/client.ts uses OLLAMA_LLM_MODEL || "fo-blog-v7".
- fo-blog-v8 remains the next training candidate, not the currently active TIP BlogLLM model.
Full Codex session handoff was added:
- sync/history/2026-04-29-codex-full-session-handoff.md
- covers TIP verification, product image/detail crawling, Blog Engine Hot Topics, TIPLLM robots, training pool, Erik status, and cross-repo sync.
Added a verification robot controller:
- packages/scraper/src/robots/verification-robots.ts
- command: npm run robots:verification -w packages/scraper -- --status
Added TIPLLM robot experience writing:
- packages/scraper/src/crawler-llm/training-data-writer.ts
- writes raw robot audit rows and SFT records.
Added Gitea training pool import to TIP learning-pool build:
- scripts/tip-learning-pool-build.ts
- imports TIP_TRAINING_REPO/qa-pairs/*.jsonl into the tip_llm lane.
Added docs:
- docs/TIP_SELFLEARNING_WORKFLOW.md
Added package script:
- packages/scraper/package.json
- robots:verification

Gitea Training Pool

Existing local clone: /tmp/tip-training-data
Gitea repo: rene/tip-training-data
Latest pushed training commit:
- f1c83f8 crawl: add robot-status training records [2026-04-29T20:11:24.091Z]
First robot experience record was written to:
- /tmp/tip-training-data/qa-pairs/robot-control-high.jsonl
- /tmp/tip-training-data/robot-experiences/2026-04-29.jsonl

MAGATAMA Training / Operations State

Relevant local repo:
- /Users/renefichtmueller/Desktop/Claude Code/magatama
Latest confirmed live MAGATAMA findings state:
- open findings: 0 on 2026-05-06
Latest confirmed live resolver state:
- Codex and Copilot intentionally idle/disabled
- not a runtime outage, but a settings choice until gateway/bridge auth is intentionally re-enabled
Latest confirmed live MAGATAMA training metric after dashboard fix:
- newSinceLastTraining: 49
Meaning:
- the old 0 was incorrect.
- the currently visible trainable MAGATAMA corpus is based on verified and deduplicated examples only.
Latest corpus integrity state after cleanup:
- operational Gitea-backed MAGATAMA training corpus is now much smaller but cleaner:
  - 1368 unique verified rows
  - 4 live failure/escalation rows in errors.jsonl
- do not confuse raw historical volume with real trainable signal.
Important training integrity rule:
- report-only or failed/escalated records must not be treated as verified training fixes.
- keep them separated from the main verified training corpus.

Erik Status

Synced TIPLLM robot/training code to /opt/tip.
Did not start crawler jobs.
Did not enqueue robot waves.
Did not restart PM2 services.
Remote scraper TypeScript build is passing after removing two stale misplaced remote-only duplicate files:
- /opt/tip/packages/scraper/src/scrapers/scheduler.ts
- /opt/tip/packages/scraper/src/vendor-discovery-crawler.ts
tip-api and tip-scraper-daemon are online.
Shared Erik note from the same chat:
- MAGATAMA dashboard/core were redeployed during compliance/training fixes.
- TIP crawler policy remains unchanged: Erik is controller/light runner only, not heavy crawl execution host.

Last Live Verification Snapshot

From 2026-04-29:

Total transceivers: 13,546
Price verified: 7,250
Image verified: 7,025
Details verified: 6,243
Fully verified: 5,812
Last price observation: 2026-04-29 19:15:53 UTC
Last stock observation: 2026-04-29 19:15:56 UTC

Latest MAGATAMA Training / RunPod Truth

Confirmed on 2026-05-06:

Lane-specific training pools are now materially separated and no longer all fallback to magatamallm.
Live Erik dashboard API now reports:
- magatamallm
  - 1367 train
  - 152 eval
  - 1519 total
  - newSinceLastTraining = 1367
- fo_blogllm
  - 17353 train
  - 1929 eval
  - 19282 total
  - newSinceLastTraining = 17353
  - active local model resolves to fo-blog-v7
- tip_llm
  - 6482 train
  - 721 eval
  - 7203 total
  - newSinceLastTraining = 6482
  - target active model is tip-llm-v1, but this model is not yet present locally in Ollama
Result:
- previous 1097 everywhere was stale / wrong.
- selected lane now controls its own manifest, model label, and training counts.

Gitea-backed Pool Materialization

magatamallm Gitea pool remains canonical and populated.
fo_blogllm and tip_llm Gitea-backed pool folders were previously almost empty; they are now materialized from the local RunPod lane exports.
Lane manifests and JSONL exports now exist under:
- training-data/gitea-learning-pool/fo_blogllm/
- training-data/gitea-learning-pool/tip_llm/

RunPod Completion Hardening

MAGATAMA dashboard code now treats RunPod COMPLETED as success only after:
1. target model artifact is referenced
2. local Mac training API adopts/imports the artifact
3. lane-specific smoke tests pass
4. active Ollama alias is updated
New local adoption endpoint is:
- POST /adopt-runpod-model

Mac Training API State

The old LaunchAgent on Mac Studio was still serving the legacy training API from:
- ~/magatama-llm/service/training_api.py
It has now been upgraded in place so Erik sees the new adoption-capable API.
Verified from Erik:
- http://192.168.178.213:3214/health returns the new service
- it now exposes register_script pointing into the MAGATAMA repo
- POST /adopt-runpod-model exists and rejects unauthenticated requests with 401, proving the route is live

Still Outstanding

A fully successful end-to-end RunPod fine-tune with:
- real worker success
- real artifact
- successful local Ollama import
- active alias switch
- smoke-test proof has not yet been re-verified after the new adoption pipeline was wired in.
tip_llm-v1 is still not installed locally in Ollama.

Pulso AI Recommendation

Keep a shared network/transceiver/switch core corpus with TIP.
Do not collapse Pulso AI into the same instruction lane as TIP_LLM.
Recommended split:
- TIP_LLM
  - research
  - crawler / scraper / robot planning
  - vendor / firmware / issue extraction
- Pulso AI
  - product responses
  - support
  - diagnostics
  - operator explanation layer

Safe Next Steps

Clone or pull Gitea origin on laptop/Claude Code.
Read this folder first.
For BlogLLM work, treat fo-blog-v7 as Adapter Bridge / PEFT adapter, not as a ~/.ollama GGUF model.
Also read llm-gateway/sync/CURRENT.md when work touches shared Erik infrastructure, LLM routing, bridges, auth, TIPLLM, or crawler orchestration.
For TIP robot/crawler planning, use TIPLLM only. Do not route this lane through external AI providers.
When training pools or model stats look suspicious, prefer verified-only counts and check whether failed/escalated rows polluted the corpus.
For MAGATAMA-adjacent work, keep writing learnings back into the Gitea-backed pool and avoid training on report-only pseudo-fixes.
If testing robots, start with dry runs only:

npm run robots:verification -w packages/scraper -- --status
npm run robots:verification -w packages/scraper -- --tipllm-plan --limit=3
npm run robots:verification -w packages/scraper -- --enqueue=details-fast-lane --profile=erik-safe --dry-run

Only dispatch real crawl work after deciding the target host:
- Erik: erik-safe, tiny batches only.
- Pi: pi-fetch.
- Proxmox: proxmox-heavy.

Dirty Worktree Note

There are existing uncommitted changes outside sync/. Some are Codex work from this session, some appear pre-existing or from earlier Claude/Codex work. Do not blindly revert them. Review git status --short before committing broader changes.

Latest Sync Commits

6c42ca7 docs: add shared agent sync handoff
8e7c5aa docs: link llm-gateway sync handoff
Pending after this update:
- watch whether any future guard exposure findings are genuine operational issues or new false positives.
- if failures still appear inside fixes.jsonl, scrub historic pollution and backfill errors.jsonl.

24 KiB Raw Blame History