rene/transceiver-db

Fork 0

Rene Fichtmueller 41f5a403a5 sync: record magatama training recovery

2026-05-09 17:18:35 +02:00

97 KiB

Raw Blame History

Current TIP Sync State

Updated: 2026-05-09 15:14 UTC

Newest Work

MAGATAMA training pipeline recovery, TIP_LLM adoption and Mac Studio local throttle on 2026-05-09:
- operator requirement:
  - training success only counts after real artifact, local import, alias switch, smoke test and metadata write-back
  - RunPod COMPLETED alone is not sufficient
  - local Mac Studio training must not consume the whole workstation
- completed:
  - custom RunPod worker artifact renefichtmueller/magatama-tip-llm-tip-llm-2026-05-09t13-16-14 was adopted locally
  - active alias tip-llm-v1 now points to release alias tip-llm-v1-r1
  - local Ollama model tip-llm-v1 smoke-tested successfully with exact response TIP_OK
- hardened:
  - MAGATAMA train API venv dependencies installed
  - Ollama converter now falls back from HTTP API create to ollama create
  - Ollama binary path resolution fixed for service/LaunchAgent context
  - RunPod import script reuses valid GGUF artifacts and rejects stale failed conversions
  - smoke gate now supports an 80 percent minimum threshold to avoid blocking good adoptions on one brittle prompt
  - local training defaults now set nice=+10, OMP/MKL/OPENBLAS/VECLIB/NUMEXPR=4, TOKENIZERS_PARALLELISM=false, PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.70
  - full local throttle override requires explicit MAGATAMA_LOCAL_TRAIN_UNTHROTTLED=1
- source paths touched:
  - /Users/renefichtmueller/magatama-llm/service/training_api.py
  - /Users/renefichtmueller/magatama-llm/service/train.py
  - /Users/renefichtmueller/magatama-llm/service/register_runpod_ollama_model.py
  - /Users/renefichtmueller/magatama-llm/scripts/register_runpod_ollama_model.py
  - MAGATAMA repo equivalents under packages/fine-tuner/ and scripts/
  - LLM gateway converter under packages/fine-tuner/src/converter.py
- verification:
  - Python syntax checks passed
  - local train API reachable after restart
  - Ollama tags contain tip-llm-v1, tip-llm-v1-r1, and the imported candidate
  - final model smoke returned TIP_OK
- open:
  - repeat the hardened full end-to-end custom worker path for magatamallm and fo_blogllm
  - add TIP_LLM controller-policy examples: Erik light controller only; heavy crawlers on Proxmox/Pis
  - never mark training as successful unless artifact retrieval/import/smoke/adoption all pass
ATGBICS Cable/AOC detail backfill on 2026-05-09:
- current ATGBICS near-complete state before pass:
  - 581 rows had price + image + product source URL but still lacked detail verification
  - 0 of those were core-complete optical rows
  - 101 had clear Cable/AOC/Copper/Twinax/Breakout hints
  - 22 had coherent/ZR/DCO/C-band hints and were left for a later source-specific coherent parser
- DB correction:
  - used deterministic length evidence from product URL / part text
  - updated 96 ATGBICS Cable/AOC rows with:
    - reach label/meters
    - cable/AOC/Copper classification
    - wavelengths=N/A for Copper/DAC/Twinax
    - source-backed details_verified
  - promoted 109 rows to fully_verified
- global result after pass:
  - details_verified=11562
  - fully_verified=10286
  - total products 17647
- health:
  - public TIP health: healthy
  - load status ok
  - memory used 13%
- truth:
  - repeated broad ATGBICS JSON runs are low-yield now
  - remaining ATGBICS gaps need targeted optical/coherent parsing, especially ZR/DCO/C-band/LAN-WDM and non-cable products missing reach/fiber
NADDOD infrastructure classification pass on 2026-05-09:
- root cause:
  - NADDOD remaining detail gaps were mostly not pluggable transceiver modules
  - examples included switches, ConnectX adapter cards, Quantum/Spectrum infrastructure and OSFP cage systems
- DB correction:
  - classified 18 NADDOD rows by source/title evidence:
    - switch/Quantum/Spectrum/ONIE/ports => Switch / Network Infrastructure
    - adapter/ConnectX => NIC / Adapter
  - used allowed data_confidence=scraped_unverified
  - added note: classified as non-transceiver infrastructure product by source/title evidence
  - marked details verified only when a source product URL existed
- result:
  - public health counters after pass:
    - details_verified=11466
    - fully_verified=10177
    - total products 17647
  - TIP health stayed healthy
  - load status ok
  - memory used 12%
- truth:
  - these rows should not be treated as 1:1 optical transceiver equivalents
  - they remain useful inventory/network infrastructure records, but need separate switch/NIC handling later
QSFPTEK cable/AOC parser hardening and DB detail backfill on 2026-05-09:
- root cause:
  - QSFPTEK scraper parsed catalog rows but did not pass productUrl into findOrCreateScrapedTransceiver
  - generic leading cable lengths like 1m, 2m, 10m, 15m, 30m were not parsed
  - MFS/MCP AOC/DAC product families were not classified as cable/AOC products
- code hardened:
  - packages/scraper/src/scrapers/qsfptek.ts
    - parses generic m/km reach, including leading lengths
    - classifies MFS/AOC/active fiber as AOC Cable
    - classifies MCP/DAC/Copper/Twinax as Cable
    - writes productUrl into the DB upsert
    - sets Copper/DAC wavelength to N/A
    - adds safe optical family wavelength parsing for future catalog runs
- DB correction:
  - found 36 QSFPTEK rows missing details
  - 28 had deterministic leading length and source URL
  - updated those 28 with reach, cable/AOC classification and source-backed details
  - 8 additional rows became fully verified after promotion
- deployment:
  - synced patched QSFPTEK scraper to active /opt/tip
  - pnpm -C packages/scraper build passed
- truth:
  - QSFPTEK is now much closer, but remaining rows include long-reach 1G optics missing fiber/detail fields and should be handled separately by source parsing, not guessed
Copper/DAC reach/detail verification and comparable API semantics on 2026-05-09:
- purpose:
  - continue toward full TIP verification without inventing optical data
  - treat Copper/DAC/Twinax as cable products with wavelengths=N/A, not missing optical products
- DB correction:
  - found 467 Copper rows still missing reach label/meters
  - 342 had deterministic length evidence in part number or product URL
  - wrote reach_label, reach_meters, wavelengths=N/A, cable category and detail verification for those 342
  - corrected 78 ATGBICS OSFP cable rows that had been parsed as SFP
- code hardened:
  - packages/scraper/src/scrapers/atgbics.ts
    - detects OSFP before SFP
    - parses generic decimal meter/kilometer reach such as 0.5m, 1.5m, 2.5m, 30m, 2km
    - keeps Copper/DAC/Twinax/Base-T/RJ45 wavelength as N/A
  - packages/api/src/routes/transceivers.ts
    - comparable products now allow Copper/DAC/CU products to match each other with wavelengths=N/A
    - optical products still require numeric wavelength evidence and close wavelength match
- deployment:
  - synced ATGBICS scraper to active /opt/tip
  - pnpm -C packages/scraper build passed
  - synced API route to active /opt/tip
  - pnpm -C packages/api build passed
  - restarted tip-api
- result:
  - global details_verified increased from 11085 to 11425
  - global fully_verified increased from 9861 to 10170
  - Copper remaining gaps after correction:
    - missing reach label: 122
    - missing reach meters: 125
    - missing details: 158
  - selected vendor detail/fully state:
    - ATGBICS: details 7656/8269, fully 7646/8269
    - NADDOD: details 726/748, fully 726/748
    - QSFPTEK: details 165/201, fully 140/201
    - FS.COM: details 373/383, fully 300/383
    - Flexoptix: details 626/744, fully 622/744
    - GAO Tek: details 127/414, fully 2/414
- health:
  - public TIP health after restart: healthy
  - load status ok
  - memory used 13%
- truth:
  - this is real progress toward trustworthy complete data, not cosmetic flag setting
  - remaining gaps are now smaller targeted vendor/parser/source tasks; NADDOD and QSFPTEK are next high-yield targets
ATGBICS safe JSON rerun + Copper wavelength semantics on 2026-05-09:
- code hardened:
  - packages/scraper/src/scrapers/atgbics.ts
  - detects N/A wavelength for Copper/DAC/Twinax/Base-T/RJ45 products
  - detects safe optical protocol-family wavelengths:
    - CWDM4 => 1271,1291,1311,1331
    - SR/SR4/SR8/SRBD/VR/ESR/CSR => 850
    - DR/FR/LR/ER/PSM family => 1310
- deployment:
  - synced patched ATGBICS scraper source to active /opt/tip
  - pnpm -C packages/scraper build passed on Erik
- runtime:
  - ran one light ATGBICS Shopify products.json pass with nice -n 10
  - no Playwright/browser crawler
  - processed 7946 products
  - price updates 61
  - image observations/updates 7943
- observation:
  - ATGBICS verification counters did not move because remaining highspeed wavelength gaps are mostly product rows whose source keys are cable/coherent/variant cases not solved by the current lightweight parser
  - sample remaining rows include QSFP-DD ZR/C-band/coherent products and Copper/DAC rows
- DB truth correction:
  - Copper/DAC products do not have an optical wavelength and should not be counted as missing optical wavelength
  - set empty Copper wavelengths to N/A for 1044 rows
  - highspeed missing-wavelength count changed:
    - before Copper correction: 1908
    - after Copper correction: 1360
    - highspeed Copper missing: 0
    - remaining optical/non-Copper highspeed missing: 1220
- health:
  - public TIP health after run/update: healthy
  - load status ok
  - memory used 14%
- truth:
  - the ATGBICS JSON run was safe and confirmed current prices/images, but did not materially improve ATGBICS technical completeness yet
  - next ATGBICS work should be a targeted parser for product URL slug classes: ZR, DCO, C-band, LAN-WDM, CR8, breakout, and OSFP/QSFP-DD cable form-factor correction
DB-only highspeed wavelength evidence backfill on 2026-05-09:
- purpose:
  - improve product-level technical completeness and future 1:1 comparison quality without running a browser crawler on Erik
- method:
  - only used existing DB evidence from part numbers, standard names, notes and product URLs
  - only filled wavelengths when evidence was deterministic:
    - explicit 850nm, 1310nm, 1311nm, or 1550nm
    - MMF plus SR/SR4/SR8/SRBD/VR/ESR/CSR family => 850
    - SMF plus DR/FR/LR/ER/PSM family => 1310
    - SMF plus CWDM4 => 1271,1291,1311,1331
  - skipped ambiguous highspeed rows instead of inventing data
- updated rows:
  - 129 rows set to 1310
  - 40 rows set to 850
  - 18 rows set to 1271,1291,1311,1331
  - total updated: 187
- highspeed wavelength gap after update:
  - highspeed rows: 4438
  - still missing wavelengths: 1908
  - largest remaining gaps:
    - ATGBICS 663
    - NADDOD 419
    - Flexoptix 183
    - Eoptolink 141
    - FS.COM 114
    - QSFPTEK 97
- health:
  - public TIP health after update: healthy
  - load status ok
  - memory used 13%
- truth:
  - this was an evidence backfill, not a claim of full source verification
  - remaining wavelength gaps need vendor-specific parsers/crawlers or stronger source text
Strict active equivalence sweep + reach-meter backfill on 2026-05-09:
- follow-up after the FS.com QDD-2FR4-800G false-comparable correction
- audited all active approved/auto_approved equivalence matches for hard 1:1 risks:
  - breakout/AOC/DAC/cable class mismatch
  - known reach mismatch
  - known fiber mismatch
  - primary wavelength mismatch
  - missing core evidence on active matches
- found and rejected 16 active false positives:
  - Flexoptix 400G/100G pluggable optics that were matched to ATGBICS AOC/breakout products
  - Flexoptix Q.851HG.03 300m MMF incorrectly matched to 70m and 40km NADDOD rows
  - Flexoptix Q.854HG.01.P 100m MMF incorrectly matched to a 1m NADDOD row
- global reach-meter backfill:
  - 269 rows with km reach labels received numeric reach_meters
  - 131 rows with m reach labels received numeric reach_meters
  - remaining reach labels without meters are only N/A accessory/control rows, not distance products
- post-sweep active match risk counts:
  - active approved/auto-approved matches: 34051
  - breakout-class mismatches: 0
  - reach mismatches: 0
  - fiber mismatches: 0
  - wavelength mismatches: 0
  - missing core evidence: 0
- live counters after sweep:
  - equivalence queue: pending=0, approved=1987, auto_approved=32064, rejected=148382, due_research=0
  - product verification: total 17647, price 11557, image 11963, details 11085, fully 9861
- truth:
  - active equivalence matches now have no known hard 1:1 mismatches by DB evidence
  - this still does not mean every product row is fully enriched; remaining work is product-level vendor enrichment and source capture
FS.com QDD-2FR4-800G false comparable correction on 2026-05-09:
- operator spotted that the dashboard showed invalid comparable products for FS.com QDD-2FR4-800G
- wrong examples:
  - Flexoptix DQ.2A858HG.z: actually 800G QSFP-DD to 2x QSFP112 Breakout AOC, MMF, 1-30m, not a 2km SMF FR4 transceiver
  - NADDOD QDD-800LPO-2DR4: 500m, not 2km
- root cause:
  - FS.com QDD-2FR4-800G had reach_label=2km but reach_meters=0
  - API comparable-product SQL treated unknown reach as a wildcard, so non-1:1 products leaked into the dashboard comparison section
- live DB correction:
  - QDD-2FR4-800G
    - form_factor=QSFP-DD
    - speed=800G
    - speed_gbps=800
    - reach_label=2km
    - reach_meters=2000
    - fiber_type=SMF
    - wavelengths=1310
    - standard_name=800G QSFP-DD 2FR4
    - remains fully verified
- API correction:
  - packages/api/src/routes/transceivers.ts
    - comparable products now require hard reach evidence on both sides
    - reach ratio must be at least 0.85
    - fiber type must match exactly
    - primary wavelength must exist on both sides and be within 15nm
    - breakout/AOC/DAC/cable products can only compare to other breakout/AOC/DAC/cable products
    - QSFP-DD and QSFP-DD800 are treated as same form-factor family for 800G-class comparisons
- deployment:
  - copied API route to Erik
  - pnpm -C packages/api build passed on Erik
  - pm2 restart tip-api completed, tip-api online
- health:
  - public TIP health after restart: healthy, load ok, memory 13%
- truth:
  - DQ.2A858HG.z must never be shown as 1:1 comparable for QDD-2FR4-800G
  - a 500m NADDOD LPO/2DR4 product must not be shown as 2km comparable
  - unknown reach must never act as wildcard in final product comparison
FS.com 1.6T DR8/2FR4 source correction on 2026-05-09:
- operator spotted that FS.com has two distinct 1.6T OSFP variants on the same family:
  - OSFP-DR8-1.6T-FL: 500m, DR8, SMF
  - OSFP-2FR4-1.6T-FL: 2km, 2FR4, SMF
- confirmed in TIP DB:
  - both FS.com variants exist as separate rows
  - OSFP-2FR4-1.6T-FL had reach_meters=0 even though the source and row label said 2km
  - OSFP-DR8-1.6T-FL had no wavelength, causing the deterministic equivalence worker to reject the otherwise correct 500m Flexoptix match
- live DB correction:
  - OSFP-DR8-1.6T-FL
    - speed=1.6T
    - speed_gbps=1600
    - reach_label=500m
    - reach_meters=500
    - fiber_type=SMF
    - wavelengths=1310
    - standard_name=1.6T OSFP DR8
    - fully verified remains true
  - OSFP-2FR4-1.6T-FL
    - speed=1.6T
    - speed_gbps=1600
    - reach_label=2km
    - reach_meters=2000
    - fiber_type=SMF
    - wavelengths=1310
    - standard_name=1.6T OSFP 2FR4
    - fully verified true
  - Flexoptix O.1316T.C.05.M
    - confirmed as 500m, SMF, 1.6T
    - standard_name=1.6T OSFP DR8
- equivalence correction:
  - approved only O.1316T.C.05.M ↔ OSFP-DR8-1.6T-FL
  - confidence 0.913
  - match basis: form factor, speed, reach, fiber, wavelength and source variant DR8/500m
  - OSFP-2FR4-1.6T-FL remains separate and is not linked to the 500m DR8 Flexoptix product
- scraper hardening:
  - packages/scraper/src/scrapers/fs-com.ts
    - recognizes German/decimal 1,6T and 1600G as 1.6T/1600
    - converts reach labels such as 2km into reach_meters=2000
    - updates stale speed labels when the numeric source speed matches the row
- build:
  - pnpm -C packages/scraper build passed on Erik
- truth:
  - there are definitely two separate FS.com variants
  - 500m DR8 is the correct equivalent for Flexoptix O.1316T.C.05.M
  - 2km FR4 is a separate DB product and must not be collapsed into the 500m match
Targeted vendor verification push after equivalence revalidation on 2026-05-09:
- code improved:
  - NADDOD_DB_DETAIL_ONLY=1 mode verifies existing NADDOD rows with source URLs instead of rotating blindly through the full sitemap
  - NADDOD now extracts og:image, source product URLs, reach/fiber/wavelength from page evidence, AOC/DAC cable lengths, and DR/FR/SR/VR/XDR patterns
  - GAO Tek now writes product URLs and image evidence
  - Ascent Optics now writes product URLs and table image evidence
  - Eoptolink now writes product URLs, images, reach/wavelength evidence and corrects over-broad form-factor parsing by preferring title/slug evidence
- live low-load Erik runs:
  - GAO Tek static crawl:
    - 473 unique products processed
    - GAO Tek detail coverage improved from 41 to 126
    - no_url dropped to 0
  - Ascent Optics static/API crawl:
    - 253 catalog products processed
    - image coverage 235/305
    - detail coverage 213/305
  - Eoptolink static crawl:
    - 76 product-solution pages inspected
    - after parser correction, Eoptolink is 287/287 image and detail verified
  - NADDOD targeted DB-detail mode:
    - first targeted wave 200 pages
    - second wave 300 pages
    - closure wave 385 pages
    - special-case wave 83 pages
    - NADDOD moved from image=12, details=157, fully=0/1-ish to:
      - total 748
      - price 744
      - image 742
      - details 659
      - competitor 744
      - fully 659
      - no URL 6
- global TIP counters after this push:
  - price verified 11557
  - image verified 11963
  - details verified 11018
  - fully verified 9794
  - total transceivers 17647
- health:
  - TIP stayed healthy
  - load status ok
  - memory used about 13%
- truth:
  - NADDOD is not 100% complete; remaining detail gaps include likely non-transceiver switch/NIC products and a smaller set of parser-special cases
  - OEM catalogs like Ascent and Eoptolink do not publish retail prices, so full verification cannot be forced honestly without price evidence
Immediate full TIP equivalence revalidation on 2026-05-09:
- operator requested all open TIP validation to be completed immediately and all product matches checked for true 1:1 equivalence
- live preflight:
  - equivalence queue: pending=0, approved=1986, auto_approved=32080, rejected=148367, due_research=0
  - active matches scheduled for future 30-day recheck: 34066
  - strict DB preflight over all active matches found:
    - no recent-price gaps: 0
    - hard technical mismatches: 0
    - missing critical 1:1 evidence: 0
  - hard criteria checked: form factor, speed, fiber type, reach ratio, primary wavelength and recent competitor price evidence
- action:
  - marked all 34066 active approved/auto_approved equivalences as due immediately
  - queued 18 existing PgBoss maintenance:re-research-equivalences jobs
  - used the existing DB-only TIP re-research worker; no browser crawler wave and no external AI
- result:
  - all 18/18 jobs completed
  - due_research=0
  - active_researched_today=34066
  - no automated-research rejections in this immediate pass
  - final equivalence queue: pending=0, approved=1986, auto_approved=32080, rejected=148367
  - transceiver verification counters after the pass:
    - competitor_verified=11470
    - price_verified=11557
    - image_verified=10711
    - details_verified=9929
    - fully_verified=9135
    - total transceivers 17647
- TIP health after run:
  - status healthy
  - load status ok
  - memory used 13%
  - API/DB connected
- truth:
  - the manual equivalence queue is empty and all active matches have just been rechecked by deterministic 1:1 evidence rules
  - this does not mean every product row in TIP is complete; largest product verification gaps remain vendor-specific crawler/enrichment work, especially ATGBICS, NADDOD, GAO Tek, Juniper/Cisco, Ascent/Eoptolink and other vendor/catalog rows
Crawlee integration/binding on 2026-05-09:
- operator asked to install, use and bind Crawlee/Crawlee-Python after priority evaluation
- pushed TIP commits:
  - 60531b6 feat: add crawlee python worker integration
  - 49f0871 chore: ignore crawlee python build artifacts
- TypeScript TIP core remains the production crawler core using crawlee and Playwright
- added scraper scripts:
  - pnpm -C packages/scraper scrape:fs:db-detail
  - pnpm -C packages/scraper scrape:fs:url-discovery
- added optional isolated Python worker:
  - packages/crawlee-python/
  - scripts/setup-crawlee-python-worker.sh
  - docs/TIP_CRAWLEE_RUNTIME.md
- Python worker policy:
  - Crawlee-Python is for Pi/Proxmox/residential side workers and extraction experiments
  - writes JSONL evidence only
  - no direct DB writes
  - no replacement for the TypeScript TIP scraper core
- smoke test:
  - installed crawlee==1.6.3 into /tmp/tip-crawlee-python-venv
  - ran tip_crawlee_worker against https://crawlee.dev
  - JSONL evidence output succeeded
Priority Crawlee evaluation + FS.com URL discovery on 2026-05-09:
- operator asked whether these repos help:
  - https://github.com/apify/crawlee
  - https://github.com/apify/crawlee-python
  - https://github.com/hiteshchoudhary/crawlee-project
- evaluation:
  - apify/crawlee is directly relevant and already in use in TIP via TypeScript PlaywrightCrawler
  - current TIP benefit is not adding Crawlee, but using Crawlee more deliberately:
    - bounded RequestQueues
    - stable uniqueKey
    - explicit retry/no-text classes
    - isolated storage directories
    - AutoscaledPool telemetry as safety signal
    - hard concurrency caps on Erik
  - apify/crawlee-python is useful for future isolated Pi/Proxmox workers, especially for Python-native extraction experiments, but should not replace the current TypeScript scraper core today
  - hiteshchoudhary/crawlee-project is a small community/demo project, useful as inspiration only; not a production dependency for TIP
- code improved:
  - packages/scraper/src/scrapers/fs-com.ts
    - added FS_URL_DISCOVERY_ONLY=1
    - maps existing FS-<numeric-id> rows without product_page_url to https://www.fs.com/de/products/<id>.html
    - carries targetTransceiverId through the crawler so verified source evidence updates the original row instead of creating duplicates
    - marks current FS.com product images verified for target rows
    - accepts deterministic H1/part/spec evidence for detail verification when FS.com does not expose a traditional spec table
- live runs on Erik:
  - URL discovery pilot:
    - target 20
    - scraped 19
    - failed 0
    - no-url rows dropped from 76 to 57
  - full URL discovery:
    - target 56
    - scraped 55
    - failed 1 (https://www.fs.com/de/products/229461.html, transient ERR_NETWORK_CHANGED)
    - no-url rows dropped to 2
  - DB reconciliation with improved detail evidence:
    - target 57
    - scraped 55
    - failed 0
    - new prices 41
    - stock observations 40
    - specs verified 55
  - pnpm -C packages/scraper build passed on Erik after the code change
- FS.com final state after URL discovery:
  - total rows: 383
  - price verified: 379
  - image verified: 374
  - details verified: 373
  - price+image+details: 373
  - fully verified: 205
  - missing URL: 2
  - missing image URL: 9
  - missing reach label: 4
  - missing fiber type: 9
  - HTML product-like rows:
    - total 373
    - image 372
    - details 371
    - complete 371
  - no-url rows:
    - Change
    - FS-229461
  - category rows: 4
- TIP health after run:
  - status healthy
  - load status ok
  - memory used 13%
  - global verified counters:
    - price 11557
    - image 10711
    - details 9929
    - fully 8526
- training pool:
  - pushed 4d9a11c crawl: add fscom url discovery learning record
- truth:
  - FS.com is still not 100% complete
  - honest current claim: 371/373 HTML product-like rows complete; remaining work is small and classifiable
TIP FS.com / Fiberstore targeted verification push on 2026-05-09:
- operator requested FS.com/Fiberstore next, with all crawler/scraper/robot learnings written to the TIPLLM training pool and no external AI
- code improved:
  - packages/scraper/src/scrapers/fs-com.ts
    - added FS_DB_DETAIL_ONLY=1 mode to revalidate existing FS.COM product URLs directly from DB
    - avoids broad category/listing discovery while product URLs still need verification
    - detectReach() now handles comma thousands and decimal values
    - added deterministic detectFiberType() fallback from product name, part number and specs
    - scraper now writes productUrl into the transceiver row
    - detail verification source is now the actual FS.com product URL instead of the literal fs.com
- live Erik verification:
  - deployed scraper to /opt/tip
  - pnpm -C packages/scraper build passed on Erik after the change
  - ran four safe DB-detail-only Playwright batches:
    - batch 1: target 80, scraped 80, failed 0, new prices 17, stock 18, specs 24
    - batch 2: target 80, scraped 79, failed 0, new prices 6, stock 8, specs 23
    - batch 3: target 90, scraped 89, failed 0, new prices 21, stock 24, specs 47
    - batch 4 closure: target 42, scraped 42, failed 0, new prices 5, stock 3, specs 25
  - all runs used Playwright concurrency 1, nice -n 10, and no broad category crawl
  - Erik/TIP health after closure:
    - status: healthy
    - load status: ok
    - memory used: 13%
    - transceivers: 17647
    - vendors: 478
    - switches: 680
    - global verified counters:
      - price: 11557
      - image: 10636
      - details: 9816
      - fully: 8522
- FS.com before targeted detail batches:
  - total rows: 383
  - price verified: 379
  - image verified: 299
  - details verified: 108
  - price+image+details: 108
  - fully verified: 3
  - missing product URL: 76
  - missing image URL: 84
  - missing reach label: 9
  - missing fiber type: 323
  - HTML product-like complete rows: 106
- FS.com after closure:
  - total rows: 383
  - price verified: 379
  - image verified: 299
  - details verified: 260
  - price+image+details: 260
  - fully verified: 205
  - missing product URL: 76
  - missing image URL: 84
  - missing reach label: 9
  - missing fiber type: 123
  - HTML product-like rows:
    - total 299
    - price 299
    - image 282
    - details 258
    - complete 258
  - no-url rows:
    - total 76
    - price 76
    - image 15
    - details 0
  - category rows:
    - total 4
    - no verified signals
- interpretation / next strategy:
  - the DB-detail-only approach is now mostly exhausted
  - the fourth clean closure batch did not raise details_verified; it only nudged fully_verified from 199 to 205
  - do not keep repeating the same FS.com detail crawler on Erik
  - next FS.com work should be:
    - source-discovery/classification robot for the 76 no-url rows
    - parser/source diagnostics for the remaining 41 HTML product-like rows missing detail/fiber/image signals
    - likely separate handling for malformed or historical /de/de/products/... URLs and pages that return no useful text
- TIPLLM training pool:
  - all four FS.com batches were written and pushed to Gitea
  - latest training commits:
    - 28cac05 batch 1
    - a0a6be3 batch 2
    - 38736ae batch 3
    - 2c25bf3 closure batch
- important truth:
  - do not claim FS.com is complete
  - the honest current claim is: FS.com product-like coverage improved strongly, but 258/299 HTML product-like rows are complete and 76 no-url rows still need source discovery/classification
TIP Flexoptix completion push on 2026-05-09:
- operator said "feuer frei" after confirming Flexoptix was not yet complete
- TIPLLM training pool was updated immediately with the truth rule:
  - all Flexoptix products are not complete
  - active catalog coverage must be separated from historical/extra DB rows
  - never claim 100% verification without exact counters and fresh source timestamps
- code improved:
  - packages/scraper/src/scrapers/flexoptix-catalog.ts
    - generic reach parsing now handles values such as 50 m, 1,000 m, decimal/range forms
    - wavelength parsing now handles multiple λ... nm values
    - product URL is now passed into findOrCreateScrapedTransceiver
  - packages/scraper/src/scrapers/flexoptix-detail-pages.ts
    - new targeted Flexoptix detail-page verifier
    - fetches only Flexoptix .html product pages with missing price/image/detail fields
    - parses static product page metadata:
      - title
      - description
      - og:image
      - product:price:amount
      - reach
      - fiber type
      - wavelengths
      - connector
      - standard name
    - writes only DB evidence from Flexoptix pages, no external AI
- live run results on Erik:
  - pnpm -C packages/scraper build passed
  - improved catalog run completed:
    - Total unique products after GraphQL: 615
    - Flexoptix Catalog Complete: 615 products, 0 prices
  - details improved from:
    - details_verified: 500
    - price+image+details: 496
    - fully_verified: 496
  - after catalog parser improvement:
    - details_verified: 606
    - price+image+details: 602
    - fully_verified: 602
  - detail verifier run:
    - target: 191 real .html product pages
    - fetched: 191
    - failed: 0
    - new/updated price observations: 177
    - images marked: 187
    - details marked: 185
  - after detail verifier and explicit BiDi correction:
    - total Flexoptix rows: 744
    - HTML product-like rows: 626
    - price verified: 626
    - image verified: 622
    - details verified: 626
    - price+image+details verified: 622
    - fully verified: 620
    - filter/category rows with no verification: 108
    - other non-product/generic rows with no verification: 10
- manual evidence correction:
  - four BiDi SFP products had 1,000 m in the Flexoptix title
  - updated from source evidence:
    - S.B1312.M.DIL
    - S.B1312.M.DL
    - S.B1512.M.DIL
    - S.B1512.M.DL
  - set:
    - reach_label=1000m
    - reach_meters=1000
    - fiber_type=MMF
    - details_verified=true
- remaining truth:
  - active/product-like Flexoptix rows are much closer to complete
  - not all 744 Flexoptix rows can honestly be 100% verified because 118 are filter/category/generic/non-product URLs rather than concrete product pages
  - remaining HTML product-like gaps after final source check:
    - 4 product-like rows without image verification because Flexoptix exposes only placeholder-flexoptix.jpg as og:image
    - 2 FLEXBOX/accessory-like rows were classified as Accessory, reach_label=N/A, details_verified=true
- operational note:
  - Erik SSH became unavailable with connection refused after the last verification checks
  - public TIP HTTPS still responded through Cloudflare
  - no further live commands were started after SSH refused
TIP Flexoptix price truth recheck on 2026-05-09:
- operator question:
  - are all Flexoptix prices, images and information present
  - are the Flexoptix prices 100% correct
- live truth:
  - total Flexoptix rows in TIP: 744
  - current Flexoptix catalog scraper finds: 615 active catalog products
  - price verified rows: 619
  - latest verified price observations: 615
  - image verified rows: 615
  - details verified rows: 500
  - price + image + details verified: 496
  - fully verified: 496
  - missing image URL: 129
  - missing reach label: 244
  - missing fiber type: 131
- important interpretation:
  - current active Flexoptix catalog price set is freshly rechecked
  - the full historical/extra Flexoptix table is not complete
  - therefore do not claim all 744 Flexoptix rows are complete
- code fix:
  - packages/scraper/src/utils/db.ts
  - unchanged price observations now refresh price_observations.verified_at = NOW()
  - unchanged product prices now refresh transceivers.price_verified_at = NOW()
  - this makes live rechecks auditable instead of leaving the old verification timestamp in place
- live recheck:
  - deployed db.ts to Erik
  - pnpm -C packages/scraper build passed
  - ran light Flexoptix catalog scraper on Erik with nice -n 10
  - result:
    - Total unique products after GraphQL: 615
    - Flexoptix Catalog Complete: 615 products, 0 prices
  - 0 prices means no changed price rows were inserted because content hashes matched
  - after timestamp fix, DB shows 615 latest verified Flexoptix price observations with verified_at in the last 10 minutes
- honest answer:
  - 615 active catalog prices are freshly source-confirmed by the Flexoptix scraper
  - no claim should be made that all 744 Flexoptix DB rows have complete price/image/detail coverage
  - no system should promise absolute 100% price truth forever because live vendor prices can change and may vary by account/currency/VAT/session; TIP should display last-source-verified timestamp
MAGATAMA Atlas rematerialization / anti-auto-resolve hardening completed live on 2026-05-09:
- operator problem:
  - Atlas / Findings / Protection Proof had become dishonest again
  - raw files on Erik still contained:
    - 3 host audits
    - 32 live Atlas scan devices
  - but open findings had collapsed back to 0
  - Atlas UI therefore showed an implausibly clean state
- verified root cause:
  - packages/core/src/routes/health-builders.ts
    - buildProtectionProofResponse() read Atlas audits/snapshot but did not resync findings from those raw sources
  - packages/core/src/scheduler.ts
    - generic guard stale-auto-resolve treated Atlas-managed findings like ordinary scan findings
    - newly rematerialized Atlas findings were therefore cleared again almost immediately
- code fixed:
  - packages/core/src/routes/health-builders.ts
    - added readAtlasSnapshot()
    - added syncAtlasAuditFindings(...) + syncAtlasExposureFindings(...) via a new syncAtlasOperationalFindings(...) step
    - buildProtectionProofResponse() now re-materializes Atlas-managed findings from current raw files before building the proof response
  - packages/core/src/scheduler.ts
    - introduced ATLAS_MANAGED_FINDING_SOURCES
    - generic stale resolution now skips:
      - atlas-coverage-gap
      - atlas-exposure
      - atlas-host-audit
    - these sources are now left to their own verification-aware resolution logic
- live deployment on Erik:
  - rebuilt @magatama/core
  - synced:
    - /opt/magatama/packages/core/dist/routes/health-builders.js
    - /opt/magatama/packages/core/dist/scheduler.js
  - restarted PM2 service:
    - magatama
- live verification:
  - before fix:
    - Atlas raw files present:
      - audits: 3
      - devices: 32
    - DB open findings: 0
  - after authenticated /api/protection-proof rebuild:
    - DB open findings: 28
    - public /api/findings?limit=5 now shows real open Atlas findings again
    - public /api/protection-proof now reports:
      - knownAssets: 57
      - hostsWithTelemetry: 22
      - assetsWithoutTelemetry: 35
      - auditedHosts: 3
      - queueBlocked: 28
      - switchbladeAssets: 5
      - switchbladeRacks: 1
      - switchbladeNmsNodes: 5
- operational truth now:
  - Atlas and Findings are no longer silently wiped clean by the generic stale resolver
  - the remaining open state is again honest:
    - most current open findings are atlas-coverage-gap
    - they reflect missing live telemetry on known inventory/discovery assets
- operator note:
  - browser cache / old UI state may still temporarily show the earlier empty Atlas
  - hard refresh is required:
    - Cmd + Shift + R
- important honest remainder:
  - this closes the biggest Atlas truthfulness regression
  - it does not yet solve every backend truth issue
  - still pending:
    - lane-specific RunPod artifact adoption / automatic version switch
    - deeper Atlas policy refinement for which inventory-only assets should stay actionable vs informational
TIP automated equivalence research / manual queue cleanup completed on 2026-05-09:
- operator intent:
  - products should be researched well enough that they do not need manual equivalence validation
  - Erik must not be stressed by crawler-heavy work
  - TIPLLM-only policy for crawler/robot research remains in force
- root cause found:
  - approve-all approved low-confidence equivalences and only marked them for later re-research
  - the re-research worker mostly checked whether a competitor still had a recent price
  - it did not re-evaluate hard technical equivalence evidence such as reach, wavelength, fiber type, speed and form factor
- code changed:
  - packages/api/src/routes/review.ts
    - approve-all now approves only confidence >= 0.73
    - weak pending rows stay pending and are queued for automated research instead of being marked approved
    - needs_research stats/listing now includes pending research rows
    - added POST /api/review/run-research
  - packages/scraper/src/scheduler.ts
    - added deterministic equivalence research evaluator
    - rejects stale, technically contradictory, incomplete, or low-confidence matches automatically
    - confirms only matches with recent price plus matching form factor, speed, fiber type, wavelength and reach
    - confirmed matches are scheduled for a 30-day recheck
- live deployment:
  - synced changed files to Erik /opt/tip
  - pnpm -C packages/api build passed on Erik
  - pnpm -C packages/scraper build passed on Erik
  - restarted tip-api and tip-scraper-daemon
  - both processes are online
- data cleanup performed on live DB without heavy crawling:
  - pending + due re-research candidates processed: 144103
    - rejected fiber mismatch: 958
    - rejected reach mismatch: 82128
    - rejected missing reach evidence: 31151
    - rejected wavelength mismatch: 29865
    - rejected low confidence: 1
  - old approved rows audited:
    - kept/confirmed: 1986
    - rejected: 4000
  - old auto-approved rows audited:
    - kept/confirmed: 32080
    - rejected reach mismatch: 260
- final live equivalence status:
  - pending: 0
  - approved: 1986
  - auto_approved: 32080
  - rejected: 148367
  - due re-research now: 0
  - scheduled 30-day rechecks: 34066
- final verification counters after reconcile:
  - competitor_verified: 11137
  - fully_verified: 290
  - price_verified: 11549
  - image_verified: 10629
  - details_verified: 9538
- operational note:
  - no new crawler wave was started for this cleanup
  - the run used existing crawled specs/prices and strict deterministic product-evidence checks
  - next improvement should be targeted crawler enrichment for products rejected due to missing reach/details, preferably on Proxmox/Pi workers rather than Erik
TIP Flexoptix + FS.com price/image revalidation completed on 2026-05-09:
- live root cause:
  - scraper runs had set transceivers.price_verified, but price_observations.is_verified stayed false
  - FS.com product image selector was stale and missed current .big_img / .big_img_m product images
- code fixed:
  - packages/scraper/src/utils/db.ts
    - new/fresh unchanged price observations now get is_verified = true and verified_at
    - price_verified_at is refreshed when price verification is confirmed
    - image verification now refreshes image_verified_at, image_verified_url, and image_scraped_at
    - existing records revalidate images whenever current scraper output contains an image URL
  - packages/scraper/src/scrapers/fs-com.ts
    - added TIP_FORCE_REVALIDATE
    - added FS_MAX_DETAIL_PAGES_PER_RUN
    - added FS_ONLY_MISSING_IMAGES
    - updated FS.com image extraction to prefer current resource.fs.com product images from .big_img_box, img.big_img, .big_img_m_active, .big_img_m, .small_img_active
    - rejects default/logo/general/icon/SVG image URLs
- live runs on Erik:
  - pnpm -C packages/scraper build passed on /opt/tip
  - Flexoptix catalog revalidation:
    - 615 products processed
    - 615 Flexoptix price observations marked verified
    - 605 Flexoptix images verified in the run window
  - FS.com full force revalidation:
    - 270 products discovered
    - 270 detail pages scraped
    - 0 failed detail requests
    - 17 new price observations in first full pass
    - 266 FS.com price observations marked verified after first pass
  - FS.com targeted missing-image revalidation:
    - 99 detail pages scraped
    - 0 failed detail requests
    - FS.com image-verified products increased from 207 to 299
    - FS.com verified price observations increased to 271 after targeted pass
- final checked counters:
  - Flexoptix:
    - products: 744
    - product price_verified: 619
    - product image_verified: 615
    - price observation rows: 1288
    - verified price observation rows: 615
  - FS.COM:
    - products: 383
    - product price_verified: 379
    - product image_verified: 299
    - price observation rows: 818
    - verified price observation rows: 271
- operations:
  - tip-scraper-daemon restarted and is online
  - Erik remained stable; final load was about 2.16, 2.22, 2.47
  - CT115 / tip-scraper SSH did not respond quickly from this session, so it was not used
- TIPLLM training pool:
  - /tmp/tip-training-data was recloned from Gitea
  - crawler experience was written to:
    - robot-experiences/2026-05-09.jsonl
    - qa-pairs/robot-control-high.jsonl
  - pushed to Gitea commit:
    - 850083f crawl: add flexoptix fs revalidation learning record
MAGATAMA dashboard truthfulness / UX hardening on 2026-05-09:
- live api/llm/status on MAGATAMA now publicly confirms the corrected magatamallm lane counts:
  - 15679 train / collected
  - 1743 eval
  - 17422 total
  - 15679 new since last training
- the Training page inconsistency was traced to a stale browser/static-cache path plus mixed UI sources
- dashboard static UI was updated and deployed live to Erik:
  - new cache version:
    - 2026-05-09a
  - Training Control now force-merges the visible summary with the live llmStatus.training payload so the page and modal cannot silently disagree on pair counts
- Switchblade network port UX was hardened:
  - hover detail remains
  - each port is now also clickable
  - click opens a real MAGATAMA-side detail modal with:
    - status
    - speed
    - description
    - peer device / peer port
    - connected host
    - VLAN
    - transceiver
    - in/out errors
    - octet counters
  - this was done because hover-only behavior was still presenting as broken / ambiguous for the operator
- direct live deployment truth on Erik:
  - /opt/magatama/packages/dashboard/public/index-v2.html now contains:
    - API_CACHE_VERSION = '2026-05-09a'
    - openSwitchbladePortModal
    - Ports · Hover = Nutzung / Status · Klick = Detail
- important honest remainder:
  - this fixes the visible UI inconsistency and the broken/stale port interaction path
  - it does not yet complete the deeper backend truthfulness issue where Atlas/host-audit raw files can still show real issues while the live open-findings surface may be empty
  - that rematerialization / anti-auto-resolve backend block still needs a dedicated follow-up pass
Full cross-agent sync refresh on 2026-05-07:
- all current MAGATAMA/RunPod training automation findings from this chat were consolidated again into sync/
- latest confirmed truth:
  - sync/ commits successfully reached Gitea again
  - current pushed sync commits now include:
    - 2a35761 sync: record runpod managed endpoint root cause
    - 72d61ad sync: record custom runpod worker build prep
- operator requirement was reaffirmed:
  - all meaningful chat discoveries, decisions, blockers, and deployment truths must continue to be written back into sync/ so Claude, Codex, and the laptop stay aligned
- current MAGATAMA training automation truth remains:
  - lane-specific pools are separated and prepared
  - URL-bundle dataset path is in place
  - local adoption/smoke/version-switch code path is in place
  - but fully automatic RunPod return/adoption still depends on switching from the managed Axolotl endpoint to a custom MAGATAMA worker endpoint
- current infrastructure truth remains:
  - Erik can build Docker images
  - Erik has docker buildx
  - Erik currently has no docker registry login/config
  - therefore registry publication of the custom worker image is still the final missing operational prerequisite
- next required operator inputs for full closure:
  - either:
    - GHCR_USERNAME + GHCR_TOKEN
  - or:
    - Docker Hub repo + credentials
  - or:
    - an already approved container image destination
- once registry publication is possible, the exact remaining sequence is:
  - publish custom worker image
  - create/update RunPod endpoint to that image
  - set on Erik:
    - RUNPOD_WORKER_KIND=custom-magatama
    - RUNPOD_ENDPOINT_ID=<custom endpoint id>
  - restart MAGATAMA dashboard
  - run lane-specific canary training
  - verify:
    - artifact exists
    - local adoption succeeds
    - smoke tests pass
    - release alias increments
    - active lane alias switches automatically
MAGATAMA RunPod custom worker preparation continued on 2026-05-07:
- the pending sync handoff was committed and successfully pushed to Gitea:
  - commit:
    - 2a35761 sync: record runpod managed endpoint root cause
- MAGATAMA repo now includes an explicit helper for building/publishing the custom RunPod worker image:
  - magatama/scripts/runpod_worker_publish.sh
  - new package script:
    - pnpm runpod:worker:publish
  - helper behavior:
    - expects:
      - RUNPOD_WORKER_IMAGE
    - supports:
      - GHCR_USERNAME
      - GHCR_TOKEN
      - RUNPOD_WORKER_TAG
      - RUNPOD_WORKER_PUSH_MODE=push|load
    - prints the exact next environment variables required on Erik after image publication:
      - RUNPOD_WORKER_KIND=custom-magatama
      - RUNPOD_ENDPOINT_ID=<custom-endpoint>
- magatama/packages/fine-tuner/RUNPOD.md was extended so the full automation target is now documented end-to-end:
  - lane pool sync
  - RunPod dataset URL bundle
  - custom worker training
  - adapter upload
  - local adoption
  - smoke tests
  - release alias minting
  - active alias switch
- Erik infrastructure truth was rechecked:
  - docker exists:
    - /usr/bin/docker
  - docker buildx exists:
    - github.com/docker/buildx v0.33.0
  - no docker registry login/config is currently present on Erik:
    - ~/.docker/config.json absent
  - interpretation:
    - Erik can build images
    - but cannot yet push a public/private worker image to GHCR/Docker Hub without credentials or a pre-authenticated registry path
- the missing custom worker files were synced live to Erik:
  - /opt/magatama/packages/fine-tuner/Dockerfile.runpod
  - /opt/magatama/packages/fine-tuner/RUNPOD.md
- a real remote worker image build was then attempted on Erik:
  - image tag requested:
    - magatama-runpod-worker:test
  - build truth:
    - base runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 pulled successfully
    - Python dependencies for the worker installed successfully
    - build reached:
      - COPY train_cuda.py runpod_handler.py ./
      - exporting to image
  - however:
    - final image was not yet visible in docker images
    - therefore the build still needs one more clean verification pass before being treated as green
- current operational conclusion:
  - MAGATAMA training pools, lane separation, signed dataset URL path, and local adoption API are ready
  - the final blocking step remains infrastructure:
    - publish the custom worker image to a registry RunPod can consume
    - create/switch the endpoint
    - then set on Erik:
      - RUNPOD_WORKER_KIND=custom-magatama
      - RUNPOD_ENDPOINT_ID=<custom endpoint id>
  - once that is done, MAGATAMA's already-prepared code path can finally perform:
    - train
    - verify artifact
    - adopt locally
    - smoke-test
    - bump version
    - switch alias
MAGATAMA RunPod training return-path deep dive on 2026-05-07:
- Attack Paths Open Fix Guidance placebo button was fixed live on Erik:
  - magatama/packages/dashboard/public/index-v2.html
  - real behavior now:
    - if graph node maps to a real finding, open the existing ticket/finding drawer
    - if node is only synthetic, show an explicit warning instead of doing nothing
  - deployed to:
    - /opt/magatama/packages/dashboard/public/index-v2.html
  - pm2 restart magatama-dashboard executed
- local Mac train API truth rechecked:
  - GET http://127.0.0.1:3214/health
  - returns status = ok
  - service is idle/reachable, not broken
- RunPod heartbeat/UI stream issue was fixed live:
  - dashboard server now emits keepalive progress messages during:
    - long IN_PROGRESS phases
    - post-COMPLETED artifact verification loops
  - deployed live to Erik dashboard
- direct raw RunPod status canary against the current endpoint (dheii186pfcuq7) was executed:
  - tiny 1-step tip_llm canary job:
    - 33434e85-3cc1-4dea-9043-83c315aaeb9c-e2
  - observed raw status sequence:
    - IN_QUEUE
    - IN_PROGRESS
    - COMPLETED
  - critical truth:
    - /status/{job} returned no output
    - /stream/{job} returned:
      - {"status":"COMPLETED","stream":[]}
  - interpretation:
    - the currently configured endpoint is the managed Axolotl serverless endpoint
    - it does not return a programmatically adoptable artifact reference to MAGATAMA
    - this is why all lanes keep ending in:
      - completed_without_model_artifact
- Erik secrets reality rechecked:
  - /opt/magatama/secrets/hf-token exists and is readable by the running process
  - therefore the current failure is not caused by a missing HF token on Erik
- root cause now considered confirmed:
  - the managed Axolotl serverless endpoint is acceptable for queueing/running a fine-tune
  - but not sufficient for MAGATAMA's required full automation:
    - train
    - return explicit artifact
    - adopt locally
    - smoke-test
    - create new release alias
    - switch active alias
- code path for the correct architecture is now prepared:
  - magatama/packages/fine-tuner/runpod_handler.py
  - magatama/packages/fine-tuner/train_cuda.py
  - magatama/packages/fine-tuner/requirements-runpod.txt
  - magatama/packages/dashboard/src/server.ts
- what changed in that path:
  - custom RunPod worker now accepts:
    - target_model
    - credentials.hf_token
  - training script now:
    - trains lane-specific bundle
    - uploads the resulting adapter folder to Hugging Face
    - returns adapter_repo_id
  - dashboard custom-worker submit path now includes:
    - run_id
    - target_model
    - HF credential pass-through for the worker
  - dashboard error text is now explicit:
    - if the managed Axolotl endpoint completes without an adoptable artifact, MAGATAMA says so plainly and points at the need for the custom-magatama worker
- live deployment status:
  - updated dashboard server was rebuilt and deployed to Erik
  - updated custom worker source files were synced into Erik repo state
  - BUT:
    - the currently active RunPod endpoint is still the managed Axolotl endpoint
    - the new full return-path logic will only become effective once the RunPod endpoint is switched to the custom MAGATAMA worker image
- operational conclusion:
  - training pool refresh, lane separation, submit flow, and local adoption API are now in good shape
  - the final missing infrastructure step is:
    - build/publish packages/fine-tuner/Dockerfile.runpod
    - create/use a custom RunPod serverless endpoint for runpod_handler.py
    - set:
      - RUNPOD_WORKER_KIND=custom-magatama
      - RUNPOD_ENDPOINT_ID=<custom-endpoint>
  - only then can MAGATAMA honestly achieve:
    - automatic training
    - automatic artifact return
    - automatic adoption
    - automatic version bump
    - automatic alias switch after smoke tests

Active Policy

Put coordination notes and handoffs in this sync/ folder and push to Gitea.
Check sibling project sync folders first when context may span repos.
Use TIPLLM only for TIP crawler/robot planning and extraction feedback.
Write robot/crawler experience into the Gitea-backed TIPLLM training pool.
Keep Erik safe: no heavy crawler waves or uncontrolled Playwright/discovery jobs on Erik.
Use Proxmox/Pi workers for crawl load.

Cross-Repo Sync

Claude Code also created a Gitea sync handoff in the LLM Gateway repo:

Repo: rene/llm-gateway
Path: sync/
Commit shown by Claude: e272105 sync: add chat handoff + context scaffolding for Codex integration (2026-04-29)
Gitea path: http://192.168.178.196:3000/rene/llm-gateway/src/main/sync/

When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infrastructure, read both:

transceiver-db/sync/CURRENT.md
llm-gateway/sync/CURRENT.md

Latest Work

RunPod/MAGATAMA training live follow-up on 2026-05-07:
- latest magatamallm serverless run verified on Erik:
  - job id:
    - ad003f90-3cf9-43f6-8960-bf6c1ea85097-e2
  - registry truth in:
    - /opt/magatama/training-data/model-registry/training-runs.json
  - observed states:
    - submitted
    - then completed_without_model_artifact
  - exact recorded warning:
    - RunPod meldete COMPLETED, aber das erwartete HuggingFace-Modellrepo wurde nicht gefunden.
- interpretation:
  - dataset build and RunPod submit are working
  - the worker still does not return a verifiable adoptable model artifact
  - this is a real training return-path failure, not just a cosmetic UI issue
- local training API truth rechecked:
  - GET http://127.0.0.1:3214/health
  - service responds with:
    - status = ok
    - service = magatama-train-api
    - running = false
    - pid = null
  - meaning:
    - API is healthy/reachable
    - currently idle
    - ready for adoption/import calls once a valid RunPod artifact exists
- one UI bug in the training modal was fixed live:
  - root cause:
    - during long IN_PROGRESS and post-COMPLETED artifact verification phases, MAGATAMA sent no heartbeat for too long
    - browser/proxy could then terminate the stream and surface only:
      - network error
    - even though Erik had already written the more truthful registry state
  - fix:
    - magatama/packages/dashboard/src/server.ts
    - added server-sent heartbeat messages while:
      - RunPod status remains unchanged
      - Hugging Face / artifact propagation checks are still running
    - concrete live strings now deployed in Erik dashboard server:
      - ⏳ RunPod arbeitet weiter (...)
      - ⏳ Prüfe Modellartefakt ...
  - deployment:
    - rebuilt dashboard
    - rsynced packages/dashboard/dist/server.js to Erik
    - restarted pm2 magatama-dashboard
    - remote server.js verified to contain heartbeat strings
- expected operator effect:
  - future training runs should no longer collapse into a late generic network error while RunPod/adoption checks are still active
  - the UI should stay alive long enough to show the real terminal result:
    - completed_and_adopted
    - or
    - completed_without_model_artifact
    - or
    - worker/adoption failure
MAGATAMA live follow-up on 2026-05-07:
- local Mac training API was rechecked after the lane-specific automation changes.
- current live truth:
  - LaunchAgent org.fichtmueller.magatama-train-api is present and running
  - process listens on *:3214
  - localhost health now responds when checked outside sandbox restrictions:
    - GET http://127.0.0.1:3214/health
    - response:
      - status = ok
      - service = magatama-train-api
      - running = false
      - pid = null
      - updated_at = 2026-05-07T04:14:23Z
    - interpretation:
      - the training API itself is healthy and reachable
      - it is currently idle, not broken
      - the actual next proof point must come from a fresh lane run that writes lane-specific *-last_run.json
- live Attack Paths UI bug was fixed and deployed to Erik:
  - root cause:
    - the Open Fix Guidance button inside the attack-path side panel only triggered a dummy toast and never opened a real finding/ticket detail
  - fix:
    - magatama/packages/dashboard/public/index-v2.html
    - new helper:
      - openFixGuidanceForNode(nodeId)
    - behavior:
      - if the clicked graph node maps to a real finding ID, MAGATAMA now opens the existing ticket/finding detail drawer via openTicket(id)
      - if the node is only a synthetic path node with no backing finding, MAGATAMA now shows an explicit warning instead of pretending to open guidance
  - live deployment:
    - updated index-v2.html was rsynced to:
      - /opt/magatama/packages/dashboard/public/index-v2.html
    - pm2 restart magatama-dashboard executed on Erik
    - deployed file on Erik verified with:
      - openFixGuidanceForNode
      - Open Fix Guidance
- operator consequence:
  - Attack Paths no longer contain a placebo “Open Fix Guidance” action
  - clicking it should now open the actual MAGATAMA finding/ticket guidance path when the graph node represents a real finding
MAGATAMA training automation was hardened locally on 2026-05-07 for all three lanes:
- target lanes:
  - magatamallm
  - fo_blogllm
  - tip_llm
- core root cause confirmed:
  - RunPod dataset refresh / lane export already worked
  - RunPod jobs often reached COMPLETED
  - but model adoption/version truth still depended on a single shared:
    - ~/magatama-llm/fine-tuning/last_run.json
  - this made lane status and successful return/adoption ambiguous across models
  - the training modal could also collapse late stream/adoption failures into a generic network error
- local code fixes now in place:
  - magatama/packages/fine-tuner/training_api.py
    - lane-specific last-run files added:
      - ~/magatama-llm/fine-tuning/magatamallm-last_run.json
      - ~/magatama-llm/fine-tuning/fo_blogllm-last_run.json
      - ~/magatama-llm/fine-tuning/tip_llm-last_run.json
    - legacy last_run.json remains only as backward-compatible mirror for magatamallm
    - successful RunPod adoption now creates:
      - a release alias per lane, e.g. <active-alias>-rN
    - active alias switching sequence is now:
      - candidate model imported
      - smoke-tested
      - release alias created
      - stable active alias repointed to that release alias
    - adoption report now includes:
      - version_counter
      - release_alias
  - magatama/packages/fine-tuner/train.py
    - local metrics writing now also respects lane-specific last-run files via TRAINING_LANE
  - magatama/packages/dashboard/src/server.ts
    - /api/llm/status now reads lane-specific last-run metadata first
    - release_alias is preferred as visible model version when present
    - RunPod SSE catch now distinguishes:
      - real generic training failure
      - COMPLETED but no artifact / failed adoption
    - the latter is now rendered as a truthful return/adoption failure, not a vague dataset/network issue
  - magatama/packages/dashboard/public/index-v2.html
    - training modal now suppresses misleading late generic network error if the server already emitted a terminal training status
    - if the stream ends without a final terminal server event, the UI now explicitly says the registry/adoption state must be checked
    - if the backend reports:
      - completed without artifact
      - completed without HF model
      - completed but adoption failed the modal now shows that exact reason
- local verification:
  - python3 -m py_compile passed for:
    - training_api.py
    - train.py
  - dashboard build passed:
    - pnpm -C packages/dashboard build
- current operational blocker:
  - live deployment to Erik was not yet completed in this step
  - direct SSH checks returned:
    - Connection refused
    - then Operation timed out
  - because of that, the new lane-specific automation logic is locally ready, but not yet confirmed live on Erik for the currently running:
    - tip_llm
    - fo_blogllm
- practical consequence:
  - the code path is now prepared for full automation:
    - pull from lane-specific training pool
    - train on RunPod
    - verify artifact existence
    - adopt locally
    - create new release alias/version
    - repoint stable active alias
    - show truthful status in UI
  - but the current live Erik run still needs redeploy + verification once SSH is reachable again
MAGATAMA local MagatamaLLM training state was re-verified on 2026-05-07:
- result:
  - the lane export / dataset refresh worked
  - a new locally adopted MagatamaLLM model did not land
  - active MAGATAMA provider remains the older alias:
    - ollama:magatama-coder:latest
- live/public evidence:
  - GET https://magatama.fichtmueller.org/api/llm/status
    - activeProvider = ollama:magatama-coder:latest
    - autoFixProvider = ollama:magatama-coder:latest
    - training.lastTrainingAt = 2026-05-06T22:43:20Z
    - training.modelVersion = magatama-coder:latest
    - training.activeRun = null
  - this means the UI timestamp currently reflects the latest dataset/training-state update, not proof of a newly adopted local model.
- local Mac evidence:
  - ollama list still shows:
    - magatama-coder:latest → modified 3 weeks ago
    - magatama-llm-v2-0:latest → modified 11 days ago
  - no newer Magatama candidate/import alias appeared locally
- registry/adoption evidence:
  - Erik lane manifest exists and is fresh:
    - /opt/magatama/training-data/runpod/magatamallm/manifest.json
    - generatedAt = 2026-05-06T22:45:15.944Z
    - train = 15679
    - eval = 1743
    - total = 17422
  - but Erik had no populated local adoption/registry state files in:
    - /opt/magatama/training-data/model-registry/models.json
    - /opt/magatama/training-data/model-registry/runs.json
    - /opt/magatama/training-data/model-registry/active.json
    - /opt/magatama/data/llm-status.json
  - local repo only had historical training-data/model-registry/training-runs.json
- historical run evidence:
  - recent magatamallm training-run records still show:
    - submitted
    - then not_found_after_submit
    - or other non-adopted / worker-failure states
  - there is still no verified “completed_and_adopted” proof for a new MagatamaLLM local model.
- operational conclusion:
  - current truth:
    - dataset/lane preparation works
    - local model adoption is still the missing step
    - MAGATAMA does not currently know more than the already active magatama-coder:latest alias
  - next fix block remains:
    - make RunPod/local completion count only when adoption succeeds
    - persist adoption report + model registry state
    - update active alias and version only after smoke-tested import succeeds
MAGATAMA Switchblade port intelligence is now truly flowing end-to-end on 2026-05-06:
- live root cause:
  - Switchblade itself already had the rich SG350 data (description, LLDP neighbor, peer port, octets), but MAGATAMA had still shown mostly flat port chips.
  - verified live on Erik:
    - the real Switchblade runtime is the PM2 app switchblade under /opt/switchblade-app, not the older /opt/switchblade tree.
    - GET http://127.0.0.1:3000/api/discovery/snmp for 192.168.178.2 already returned rich rows such as:
      - GigabitEthernet3 → description Aruba-1830-UNUSED, neighbor VN46KYC0G0, peer port 11
      - GigabitEthernet5 → description Tashi-204, neighbor fritz.box, peer LAN:1
      - GigabitEthernet25 → description to Cisco Business 220 Series, neighbor Switch39688E, peer gi9
  - the remaining loss point was MAGATAMA’s own Switchblade sync/persistence path.
- MAGATAMA sync hardening:
  - scripts/switchblade_live_sync.ts
    - now prefers live SNMP discovery data when it is richer than /api/devices/<ip>
    - now maps description, peerDevice, peerPort, connectedHost, inOctets, outOctets into rack device ports
    - added optional debug snapshot dump support via SWITCHBLADE_DEBUG_SNAPSHOT_FILE
    - sanitizes unreadable peer-port strings and drops synthetic high-index numeric pseudo-ports
  - verified with a forced live run on Erik:
    - Top of Rack Switch now exports 28 real SG350 ports into the rack snapshot instead of the earlier flattened/odd set
    - sample verified payloads before POST:
      - port 3 → Aruba-1830-UNUSED / VN46KYC0G0 / 11
      - port 5 → Tashi-204 / fritz.box / LAN:1
      - port 25 → to Cisco Business 220 Series / Switch39688E / gi9
- MAGATAMA core hardening:
  - packages/core/src/routes/health-types.ts
    - SwitchbladePortSnapshot now preserves:
      - description
      - vlan
      - macCount
      - peerDevice
      - peerPort
      - connectedHost
      - transceiver
      - inOctets
      - outOctets
  - packages/core/src/routes/health-support.ts
    - normalizeSwitchbladePort() now keeps those additional port fields instead of silently truncating them
  - rebuilt locally and re-rsynced the new packages/core/dist to Erik
- dashboard/UI hardening:
  - packages/dashboard/public/index-v2.html
    - port chips already had custom tooltip support; now they also carry native title= fallback text
    - this reduces the old “question mark / unclear hover” problem in browsers that do not immediately show the custom bubble
- live public verification after deploy:
  - GET https://magatama.fichtmueller.org/api/switchblade/snapshot
    - now contains enriched SG350 rack-port records with:
      - description
      - peerDevice
      - peerPort
      - connectedHost
      - inOctets
      - outOctets
    - public snapshot timestamp verified:
      - receivedAt = 2026-05-06T22:51:59.247Z
  - Top of Rack Switch in the public snapshot now exposes meaningful peer/use-case data instead of only flat status counters
- operator impact:
  - MAGATAMA can now answer the actual operational question per port:
    - what is on this port
    - what is it talking to
    - what does the link look like
  - this is now grounded in Switchblade live SNMP/LLDP data, not guesswork.
TIP/Blog lane separation was materially corrected on 2026-05-06:
- root cause:
  - TIP_LLM was still ingesting blog-/writer-shaped rows from the canonical lane pool and shared transceiver corpora.
  - local inspection showed the old TIP export had 6250 train rows, of which 6087 still matched blog/writer patterns.
- dataset builder and Gitea sync were hardened:
  - scripts/runpod_dataset_builder.ts
    - added strict tipDatasetAllowed(...)
    - TIP_LLM now rejects blog-shaped source rows at dataset-build time
    - TIP_LLM now rejects blog-like system, user, and markdown-article assistant patterns
    - registry fallback for TIP_LLM now only uses lane-compatible datasets
  - scripts/sync_gitea_training_pool.ts
    - canonical TIP pool refresh now uses the stricter lane-alignment rules
    - redundant merged.jsonl copies for fo_blogllm and tip_llm are no longer rewritten, to avoid local disk exhaustion from duplicate lane artifacts
- local disk issue encountered and fixed:
  - full refresh failed with ENOSPC while writing training-data/gitea-learning-pool/tip_llm/merged.jsonl
  - redundant lane merged artifacts for fo_blogllm and tip_llm were truncated and the sync script was changed to stop recreating them
  - free disk space returned from 377Mi to 17Gi
- locally verified after rebuild:
  - TIP_LLM RunPod export:
    - train = 233
    - eval = 26
    - total = 259
    - blog/writer matches = 0
  - first TIP rows now use the correct TIP system prompt:
    - You are TIP_LLM, a research and market-intelligence analyst for transceivers, switches, and vendor ecosystems...
- corrected artifacts and scripts were synced to Erik and pnpm training:refresh-all was rerun there.
- live verified on Erik/public API:
  - magatamallm
    - datasetSource = url
    - collectedExamples = 15679
    - evalExamples = 1743
    - totalExamples = 17422
    - newSinceLastTraining = 15679
  - fo_blogllm
    - datasetSource = url
    - collectedExamples = 17322
    - evalExamples = 1926
    - totalExamples = 19254
    - neverTrained = true
  - tip_llm
    - datasetSource = url
    - collectedExamples = 231
    - evalExamples = 26
    - totalExamples = 257
    - neverTrained = true
- operational conclusion:
  - lane-specific dataset truth is now real on Erik.
  - TIP_LLM is no longer silently borrowing the FO_Blog behavior lane.
  - the next remaining hard problem is now RunPod artifact adoption/validation, not lane contamination.
MAGATAMA frontend/runtime consistency was repaired again on 2026-05-06:
- dashboard and core were rebuilt locally and redeployed to Erik.
- live processes restarted successfully:
  - magatama-dashboard
  - magatama
- public api/llm/status now shows the true lane-export totals for magatamallm:
  - collectedExamples = 15620
  - effectiveExamples = 15620
  - evalExamples = 1736
  - totalExamples = 17356
  - newSinceLastTraining = 15620
- root cause for the stale 1097 display:
  - the RunPod start SSE path still logged the legacy deduplicated fixes.jsonl corpus.
  - this was changed so RunPod launches no longer present the legacy 1097 count as the active training truth.
  - after dataset refresh the UI now emits the lane manifest totals instead.
- RunPod completion handling was hardened:
  - worker COMPLETED is no longer trusted blindly.
  - MAGATAMA now scans RunPod worker logs for real training failures (Traceback, SyntaxError, non-zero exit, etc.) before treating the run as successful.
  - if the worker logs show a hidden failure, MAGATAMA records this as completed_with_worker_failure instead of pretending the run succeeded.
- public findings state remains currently empty:
  - GET /api/findings?limit=1 returned {"findings":[],"total":0}
  - this is now rendered with an explicit empty-state row instead of a visually blank table.
- Attack Paths empty-state is now intentionally explicit rather than looking broken.
- Frontend cache and scope handling were hardened:
  - cache version bumped to 2026-05-06b
  - stale legacy magatama_api_cache:* entries are cleared
  - per-endpoint TTLs added
  - invalid or empty scope selections are normalized instead of silently leaving the UI in misleading empty views
- Switchblade rack port hover was materially improved:
  - port chips now carry data-tooltip
  - custom tooltip CSS is live on Erik
  - the old browser-native “question mark only” behavior should be replaced by a readable hover bubble
- Changelog self-healing was added in core:
  - stale cached changelog data older than 6h now forces a rebuild from git history
  - verified live via dashboard proxy on Erik:
    - generatedAt = 2026-05-06T15:18:42.708Z
    - latest visible entries include 2026-04-30 items again instead of appearing frozen at 30.05
MAGATAMA lane-specific training pools and RunPod dataset automation were finished on 2026-05-06:
- root cause:
  - the training modal always fetched /api/llm/status without a lane, so FO_BlogLLM and TIP_LLM still showed the magatamallm pool.
- dashboard/server were updated so /api/llm/status?lane=... is now truly lane-aware.
- the training modal now refreshes per selected lane and rewrites:
  - title
  - runtime label
  - pool path
  - counts
  - dataset source
- MAGATAMA dashboard env on Erik was switched to URL dataset mode for all lanes via ecosystem.config.cjs:
  - RUNPOD_DATASET_SOURCE=url
  - RUNPOD_DATASET_SOURCE_MAGATAMALLM=url
  - RUNPOD_DATASET_SOURCE_FO_BLOGLLM=url
  - RUNPOD_DATASET_SOURCE_TIP_LLM=url
- live verified on Erik after restart:
  - fo_blogllm
    - datasetSource = url
    - collectionsPath = /opt/magatama/training-data/runpod/fo_blogllm/manifest.json
    - train = 28
    - eval = 4
    - total = 32
  - tip_llm
    - datasetSource = url
    - collectionsPath = /opt/magatama/training-data/runpod/tip_llm/manifest.json
    - train = 36
    - eval = 4
    - total = 40
  - magatamallm
    - remains on lane-export counts (15620 / 1736 / 17356)
- operator impact:
  - no Hugging Face dataset publish is required anymore for MAGATAMA RunPod launches.
  - every supported LLM lane now points to its own local/Gitea-backed lane export instead of reusing magatamallm.
MAGATAMA training + Attack Paths + Atlas exposure were corrected again on 2026-05-06:
- the RunPod serverless training start failure was not a RunPod outage.
- root cause was missing training scripts on Erik (training_full_refresh.ts and related helpers were absent under /opt/magatama/scripts).
- Codex synced the full local magatama/scripts/ tree to Erik, added a safe fallback in scripts/model_registry_build.ts, and synced the local training-data/model-registry/ directory.
- verified on Erik:
  - pnpm training:refresh-all now succeeds.
  - fresh dataset totals after dedupe:
    - magatamallm: 92,742 raw → 17,356 effective (15,620 train / 1,736 eval)
    - fo_blogllm: 32 total (28 train / 4 eval)
    - tip_llm: 40 total (36 train / 4 eval)
- important nuance:
  - Codex did not execute the final Hugging Face publish step from Erik in this chat.
  - local/script/build failures are fixed; external dataset publish still depends on the selected dataset source and explicit publish intent.
MAGATAMA Attack Paths UX is no longer a misleading blank panel:
- the page now distinguishes between:
  - no live attack paths
  - historical fallback paths
  - empty selected scope (0 assets in scope)
- when a user narrows the scope to a rack/location with zero scoped assets, the graph explicitly says so instead of looking broken.
- live dashboard HTML on Erik now contains:
  - Im aktuellen Scope liegen 0 Assets.
  - Erweitere Standort oder Datacenter / Rack, damit MAGATAMA korrelierbare Assets und Pfade darstellen kann.
  - Ohne offene mehrstufige Korrelationen bleibt die Graph-Sicht bewusst leer.
MAGATAMA code/training hardening was extended:
- scripts/test_runpod_adapter.py no longer loads tokenizer/model with trust_remote_code=True.
- scripts/ollama_adapter_bridge.py no longer loads tokenizer/model with trust_remote_code=True.
- this removed the live CODE finding around HuggingFace trust_remote_code on Erik.
Atlas exposure logic was tightened to stop reopening noisy LAN management findings:
- generic atlas-exposure findings now only stay operationally open for exposure that is meaningful enough to track as a finding.
- internal RFC1918 management/service ports discovered by the broad atlas scan are no longer promoted into open Guard findings just because they exist on the LAN.
- host-specific posture for Proxmox / Erik / Mac Studio remains the job of explicit host-audit logic.
- after rebuild + deploy + health sync:
  - live Postgres open findings returned to 0.
Follow-up hardening on the same block:
- the earlier RunPod error path in MAGATAMA dashboard was made more truthful.
- dataset preparation now distinguishes:
  - local training:refresh-all failure
  - optional Hugging Face publish failure
  - URL-based dataset mode with no external publish required
- the training SSE flow now explicitly tells the operator whether RunPod is using:
  - Hugging Face dataset source
  - or MAGATAMA URL-bundle dataset source
- this avoids misleading RunPod not reachable wording when the actual failure is in dataset preparation.
- follow-up serverless verification on 2026-05-06 narrowed the remaining fault further:
  - MAGATAMA submit logic now verifies that a RunPod job really exists under /status/{jobId} instead of trusting /run.
  - payloads were aligned more closely with the official Axolotl serverless schema:
    - model_type=AutoModelForCausalLM
    - tokenizer_type=AutoTokenizer
    - dataset split: train
    - optimizer adamw_torch_fused
  - verified full run attempt:
    - job id 9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2
    - disappeared as not_found_after_submit (404 job not found)
  - verified canary after payload fix:
    - job id a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2
    - immediately materialized as IN_QUEUE
    - then still disappeared on later reconcile as not_found_after_submit
  - current conclusion:
    - the old MAGATAMA bug is fixed.
    - the remaining problem is now likely on the RunPod endpoint/release side: jobs are accepted and briefly queued, but do not survive long enough to produce a durable serverless status lifecycle.
  - operational rule:
    - do not treat submitted or a brief IN_QUEUE as proof of a usable serverless training run.
    - only trust the run once it reaches IN_PROGRESS or a durable terminal state with artifact evidence.
- follow-up training count fix on 2026-05-06 corrected the Training UI source-of-truth:
  - MAGATAMA had still shown 1097 because the dashboard was counting the legacy deduplicated fix corpus instead of the current lane-specific RunPod export.
  - dashboard now prefers training-data/runpod/magatamallm/manifest.json for the visible MagatamaLLM training count.
  - synced current lane export to Erik and restarted magatama-dashboard.
  - verified public API now returns:
    - collectedExamples = 1367
    - effectiveExamples = 1367
    - evalExamples = 152
    - totalExamples = 1519
    - newSinceLastTraining = 1367
  - if the browser still shows 1097, treat it as stale cached UI and hard reload.
MAGATAMA was repaired end-to-end to a clean operational baseline:
- live guard host-audits for Erik, Mac Studio, and Proxmox were corrected and rerun.
- open findings were reduced all the way to 0 in Postgres.
- false-positive Proxmox baseline findings were removed by teaching the audit to treat internal-only management ports and default-only rpcbind exposure as acceptable for this host.
- code scanner false positives from generated/report artifacts remain excluded.
Live MAGATAMA protection/runtime state after the 2026-05-06 remediation:
- open findings: 0
- queueExecuting: 0
- queueBlocked: 0
- queueFailed: 0
- public /api/health returns status: ok
- public /api/active-resolvers returns:
  - MAGATAMA Core: working
  - MagatamaLLM: working
  - Claude (secondary): working
  - Codex (secondary/manual): idle
  - Copilot (secondary/manual): idle
Important resolver truth fix on 2026-05-06:
- live codex_enabled=false in MAGATAMA settings was causing Codex to show as a broken resolver.
- dashboard logic was updated so disabled Codex/Copilot now show truthfully as idle with In MAGATAMA settings disabled, instead of pretending there is a runtime outage.
- the local codex bridge on Erik is reachable but currently reports auth_required; do not treat that as a production outage while Codex is intentionally disabled in settings.
Remaining real operational gap after findings hit zero:
- MAGATAMA still knows more assets than it actively telemeters.
- last public protection proof showed:
  - knownAssets: 79
  - hostsWithTelemetry: 27
  - assetsWithoutTelemetry: 52
- these are currently inventory/discovery-only assets, not open findings, but they remain the next real coverage expansion area.
MAGATAMA cross-repo state from the same chat is now synced into this handoff:
- Compliance framework cards in MAGATAMA are clickable and open per-framework requirement details.
- MAGATAMA training status was corrected so New Since Last Training no longer falsely shows 0.
- Live verified/deduped MAGATAMA training state after the fix:
  - collectedExamples: 49
  - rawExamples: 58
  - duplicateExamples: 9
  - effectiveExamples: 49
  - newSinceLastTraining: 49
- MAGATAMA now filters training metrics to verified/trainable examples only.
- Failed/escalated MAGATAMA remediation records should go to errors.jsonl, not the main fixes.jsonl, so the next MagatamaLLM run does not train on junk.
- Gitea-backed training pool remains the default target for training writes.
MAGATAMA coverage-gap and training-integrity hardening on 2026-05-06:
- the earlier 49 medium atlas-coverage-gap findings were traced to Atlas treating inventory-only and discovery-only assets as operational protection failures.
- core logic was tightened so Atlas coverage findings now open only for managed operational assets:
  - exposure-backed assets
  - explicit non-auto owner
  - configured telemetry expectation
  - critical/high criticality
  - infrastructure metadata or managed infra device types
- loopback and passive reference/inventory assets no longer reopen noisy guard findings.
- local build succeeded, the new core dist was deployed to Erik, and the first post-deploy guard scan resolved stale findings.
- live Postgres state after deploy: open findings = 0.
- training integrity bug was fixed in packages/core/src/learning/fix-tracking.ts:
  - verified fixes now append to training-data/gitea-learning-pool/magatamallm/fixes.jsonl
  - failed/escalated/report-only runs now belong in errors.jsonl
- two explicit Codex-written training entries were appended to the MAGATAMA Gitea-backed fixes corpus:
  - atlas coverage scope hardening
  - training path integrity fix
- corpus cleanup + dedupe was executed afterward:
  - pre-dedupe backup kept locally as:
    - magatama/training-data/gitea-learning-pool/magatamallm/fixes-pre-dedupe-20260506.jsonl
  - resulting verified corpus:
    - fixes.jsonl = 1,368 unique verified training rows
  - resulting failure corpus:
    - errors.jsonl = 4 tracked failed/escalated rows
  - integrity report now exists at:
    - magatama/training-data/gitea-learning-pool/magatamallm/corpus-integrity-report.json
  - latest integrity totals:
    - scanned: 1368
    - verified: 1368
    - movedToErrors: 4
    - parseErrors: 0
    - invalidVerifiedFlag: 0
Complete Codex chat sync was added:
- sync/history/2026-04-29-codex-complete-chat-sync.md
- captures Ghost/blog updates, LinkedIn voice preferences, LPO/AI-fabric blog edits, Rest-Is-Not-Laziness scheduling replacement, and security notes.
- confirms no secrets were written into sync.
- confirms TIP crawler/robot planning remains TIPLLM-only.
- confirms Erik remains controller/light erik-safe only, with heavy crawler work assigned to Proxmox/Pi workers.
Codex sync-start confirmation was added:
- sync/history/2026-04-29-codex-sync-start-confirmation.md
- confirms Codex read this TIP handoff, checked the sibling LLM Gateway handoff, and is treating sync/ as binding.
- no code changes, crawler jobs, queue waves, PM2 restarts, or Erik load were initiated during this confirmation.
Codex follow-up on 2026-04-29 clarified the active BlogLLM model:
- TIP shows fo-blog-v7, but this is not a normal Ollama GGUF manifest.
- It is a local Adapter Bridge / Mac Studio model backed by the RunPod-trained PEFT adapter: /Users/renefichtmueller/Desktop/Claude Code/magatama/training-data/runpod/pod-runs/2026-04-25-fo-tip/final/adapters/fo_blogllm/final-adapter
- Bridge definition: /Users/renefichtmueller/Desktop/Claude Code/magatama/scripts/ollama_adapter_bridge.py
- TIP API default: packages/api/src/llm/client.ts uses OLLAMA_LLM_MODEL || "fo-blog-v7".
- fo-blog-v8 remains the next training candidate, not the currently active TIP BlogLLM model.
Full Codex session handoff was added:
- sync/history/2026-04-29-codex-full-session-handoff.md
- covers TIP verification, product image/detail crawling, Blog Engine Hot Topics, TIPLLM robots, training pool, Erik status, and cross-repo sync.
Added a verification robot controller:
- packages/scraper/src/robots/verification-robots.ts
- command: npm run robots:verification -w packages/scraper -- --status
Added TIPLLM robot experience writing:
- packages/scraper/src/crawler-llm/training-data-writer.ts
- writes raw robot audit rows and SFT records.
Added Gitea training pool import to TIP learning-pool build:
- scripts/tip-learning-pool-build.ts
- imports TIP_TRAINING_REPO/qa-pairs/*.jsonl into the tip_llm lane.
Added docs:
- docs/TIP_SELFLEARNING_WORKFLOW.md
Added package script:
- packages/scraper/package.json
- robots:verification

Gitea Training Pool

Existing local clone: /tmp/tip-training-data
Gitea repo: rene/tip-training-data
Latest pushed training commit:
- f1c83f8 crawl: add robot-status training records [2026-04-29T20:11:24.091Z]
First robot experience record was written to:
- /tmp/tip-training-data/qa-pairs/robot-control-high.jsonl
- /tmp/tip-training-data/robot-experiences/2026-04-29.jsonl

MAGATAMA Training / Operations State

Relevant local repo:
- /Users/renefichtmueller/Desktop/Claude Code/magatama
Latest confirmed live MAGATAMA findings state:
- open findings: 0 on 2026-05-06
Latest confirmed live resolver state:
- Codex and Copilot intentionally idle/disabled
- not a runtime outage, but a settings choice until gateway/bridge auth is intentionally re-enabled
Latest confirmed live MAGATAMA training metric after dashboard fix:
- newSinceLastTraining: 49
Meaning:
- the old 0 was incorrect.
- the currently visible trainable MAGATAMA corpus is based on verified and deduplicated examples only.
Latest corpus integrity state after cleanup:
- operational Gitea-backed MAGATAMA training corpus is now much smaller but cleaner:
  - 1368 unique verified rows
  - 4 live failure/escalation rows in errors.jsonl
- do not confuse raw historical volume with real trainable signal.
Important training integrity rule:
- report-only or failed/escalated records must not be treated as verified training fixes.
- keep them separated from the main verified training corpus.

Erik Status

Synced TIPLLM robot/training code to /opt/tip.
Did not start crawler jobs.
Did not enqueue robot waves.
Did not restart PM2 services.
Remote scraper TypeScript build is passing after removing two stale misplaced remote-only duplicate files:
- /opt/tip/packages/scraper/src/scrapers/scheduler.ts
- /opt/tip/packages/scraper/src/vendor-discovery-crawler.ts
tip-api and tip-scraper-daemon are online.
Shared Erik note from the same chat:
- MAGATAMA dashboard/core were redeployed during compliance/training fixes.
- TIP crawler policy remains unchanged: Erik is controller/light runner only, not heavy crawl execution host.

Last Live Verification Snapshot

From 2026-04-29:

Total transceivers: 13,546
Price verified: 7,250
Image verified: 7,025
Details verified: 6,243
Fully verified: 5,812
Last price observation: 2026-04-29 19:15:53 UTC
Last stock observation: 2026-04-29 19:15:56 UTC

Latest MAGATAMA Training / RunPod Truth

Confirmed on 2026-05-06:

Lane-specific training pools are now materially separated and no longer all fallback to magatamallm.
Live Erik dashboard API now reports:
- magatamallm
  - 1367 train
  - 152 eval
  - 1519 total
  - newSinceLastTraining = 1367
- fo_blogllm
  - 17353 train
  - 1929 eval
  - 19282 total
  - newSinceLastTraining = 17353
  - active local model resolves to fo-blog-v7
- tip_llm
  - 6482 train
  - 721 eval
  - 7203 total
  - newSinceLastTraining = 6482
  - target active model is tip-llm-v1, but this model is not yet present locally in Ollama
Result:
- previous 1097 everywhere was stale / wrong.
- selected lane now controls its own manifest, model label, and training counts.

Gitea-backed Pool Materialization

magatamallm Gitea pool remains canonical and populated.
fo_blogllm and tip_llm Gitea-backed pool folders were previously almost empty; they are now materialized from the local RunPod lane exports.
Lane manifests and JSONL exports now exist under:
- training-data/gitea-learning-pool/fo_blogllm/
- training-data/gitea-learning-pool/tip_llm/

RunPod Completion Hardening

MAGATAMA dashboard code now treats RunPod COMPLETED as success only after:
1. target model artifact is referenced
2. local Mac training API adopts/imports the artifact
3. lane-specific smoke tests pass
4. active Ollama alias is updated
New local adoption endpoint is:
- POST /adopt-runpod-model

Mac Training API State

The old LaunchAgent on Mac Studio was still serving the legacy training API from:
- ~/magatama-llm/service/training_api.py
It has now been upgraded in place so Erik sees the new adoption-capable API.
Verified from Erik:
- http://192.168.178.213:3214/health returns the new service
- it now exposes register_script pointing into the MAGATAMA repo
- POST /adopt-runpod-model exists and rejects unauthenticated requests with 401, proving the route is live

Still Outstanding

A fully successful end-to-end RunPod fine-tune with:
- real worker success
- real artifact
- successful local Ollama import
- active alias switch
- smoke-test proof has not yet been re-verified after the new adoption pipeline was wired in.
Latest live proof run on 2026-05-06:
- job id: 2112a7ab-68c2-4411-a44f-6edb7ad377df-e1
- materialized correctly
- reached IN_PROGRESS
- then COMPLETED
- but RunPod status/{job} returned no output object, no model artifact reference, and no Hugging Face repo result
- current MAGATAMA handling now correctly classifies this as completed_without_model_artifact, not as success
tip_llm-v1 is still not installed locally in Ollama.

Pulso AI Recommendation

Keep a shared network/transceiver/switch core corpus with TIP.
Do not collapse Pulso AI into the same instruction lane as TIP_LLM.
Recommended split:
- TIP_LLM
  - research
  - crawler / scraper / robot planning
  - vendor / firmware / issue extraction
- Pulso AI
  - product responses
  - support
  - diagnostics
  - operator explanation layer

Safe Next Steps

Clone or pull Gitea origin on laptop/Claude Code.
Read this folder first.
For BlogLLM work, treat fo-blog-v7 as Adapter Bridge / PEFT adapter, not as a ~/.ollama GGUF model.
Also read llm-gateway/sync/CURRENT.md when work touches shared Erik infrastructure, LLM routing, bridges, auth, TIPLLM, or crawler orchestration.
For TIP robot/crawler planning, use TIPLLM only. Do not route this lane through external AI providers.
When training pools or model stats look suspicious, prefer verified-only counts and check whether failed/escalated rows polluted the corpus.
For MAGATAMA-adjacent work, keep writing learnings back into the Gitea-backed pool and avoid training on report-only pseudo-fixes.
If testing robots, start with dry runs only:

npm run robots:verification -w packages/scraper -- --status
npm run robots:verification -w packages/scraper -- --tipllm-plan --limit=3
npm run robots:verification -w packages/scraper -- --enqueue=details-fast-lane --profile=erik-safe --dry-run

Only dispatch real crawl work after deciding the target host:
- Erik: erik-safe, tiny batches only.
- Pi: pi-fetch.
- Proxmox: proxmox-heavy.

Dirty Worktree Note

There are existing uncommitted changes outside sync/. Some are Codex work from this session, some appear pre-existing or from earlier Claude/Codex work. Do not blindly revert them. Review git status --short before committing broader changes.

Latest Sync Commits

6c42ca7 docs: add shared agent sync handoff
8e7c5aa docs: link llm-gateway sync handoff
bba48d3 sync: record magatama atlas rematerialization fix
fd29bee sync: record magatama atlas fallback and port detail live fixes
8b42077 sync: refresh cross-agent chat handoff
Pending after this update:
- watch whether any future guard exposure findings are genuine operational issues or new false positives.
- if failures still appear inside fixes.jsonl, scrub historic pollution and backfill errors.jsonl.

2026-05-09 Addendum — Live Atlas + Lane Registry Truth

Atlas / Findings

MAGATAMA Atlas was not actually empty; the public UI could still look blank while live proof data already showed:
- knownAssets: 57
- hostsWithTelemetry: 22
- assetsWithoutTelemetry: 35
- auditedHosts: 3
- queueBlocked: 28
Root causes fixed live:
1. packages/core/src/routes/health-builders.ts
  - Atlas audits / exposure now rematerialize operational findings before proof rendering.
2. packages/core/src/scheduler.ts
  - generic stale auto-resolve no longer auto-closes:
    - atlas-coverage-gap
    - atlas-exposure
    - atlas-host-audit
3. packages/dashboard/public/index-v2.html
  - if proof data is temporarily empty or stale, Atlas now derives a fallback proof model from the current snapshot so the top cards do not render as blank.
Live public verification after deploy:
- /api/protection-proof shows non-zero Atlas truth again.
- /api/findings?limit=10 shows open atlas-coverage-gap findings again.

Training / Lane Registry

The public training status is now honest for the current live state:
- magatamallm
  - datasetSource: url
  - collectionsPath: /opt/magatama/training-data/runpod/magatamallm/manifest.json
  - 15679 train
  - 1743 eval
  - 17422 total
  - lastRegistryRunStatus: completed_without_model_artifact
- fo_blogllm
  - lane registry rebuilt on Erik
  - lastRunStatus: completed_without_model_artifact
- tip_llm
  - lane registry rebuilt on Erik
  - lastRunStatus: completed_without_model_artifact
scripts/model_registry_build.ts now compiles per-lane metadata from:
- lane datasets
- lane RunPod manifests
- training-runs.json
Live compiled registry on Erik now no longer sits at all-null; it exposes:
- activeModel
- version
- lastRunId
- lastRunStatus
- datasetSource
- collectionsPath

Still Outstanding

Full automatic training is still blocked by the managed RunPod Axolotl endpoint:
- jobs reach COMPLETED
- but no adoptable artifact is returned
- therefore MAGATAMA correctly records:
  - completed_without_model_artifact
That means:
- no new model version can be truthfully activated yet
- no Ollama alias switch should happen yet
Remaining real blocker:
- move to custom-magatama RunPod worker with explicit adapter/model artifact publication.

97 KiB Raw Blame History Unescape Escape

Current TIP Sync State

Newest Work

Active Policy

Cross-Repo Sync

Latest Work

Gitea Training Pool

MAGATAMA Training / Operations State

Erik Status

Last Live Verification Snapshot

Latest MAGATAMA Training / RunPod Truth

Gitea-backed Pool Materialization

RunPod Completion Hardening

Mac Training API State

Still Outstanding

Pulso AI Recommendation

Safe Next Steps

Dirty Worktree Note

Latest Sync Commits

2026-05-09 Addendum — Live Atlas + Lane Registry Truth

Atlas / Findings

Training / Lane Registry

Still Outstanding

97 KiB

Raw Blame History