sync: record fscom targeted verification push
This commit is contained in:
parent
f64dbf7b6b
commit
7ddc439315
@ -1,9 +1,96 @@
|
||||
# Current TIP Sync State
|
||||
|
||||
Updated: 2026-05-09 07:34 UTC
|
||||
Updated: 2026-05-09 09:18 UTC
|
||||
|
||||
## Newest Work
|
||||
|
||||
- TIP FS.com / Fiberstore targeted verification push on 2026-05-09:
|
||||
- operator requested FS.com/Fiberstore next, with all crawler/scraper/robot learnings written to the TIPLLM training pool and no external AI
|
||||
- code improved:
|
||||
- `packages/scraper/src/scrapers/fs-com.ts`
|
||||
- added `FS_DB_DETAIL_ONLY=1` mode to revalidate existing FS.COM product URLs directly from DB
|
||||
- avoids broad category/listing discovery while product URLs still need verification
|
||||
- `detectReach()` now handles comma thousands and decimal values
|
||||
- added deterministic `detectFiberType()` fallback from product name, part number and specs
|
||||
- scraper now writes `productUrl` into the transceiver row
|
||||
- detail verification source is now the actual FS.com product URL instead of the literal `fs.com`
|
||||
- live Erik verification:
|
||||
- deployed scraper to `/opt/tip`
|
||||
- `pnpm -C packages/scraper build` passed on Erik after the change
|
||||
- ran four safe DB-detail-only Playwright batches:
|
||||
- batch 1: target `80`, scraped `80`, failed `0`, new prices `17`, stock `18`, specs `24`
|
||||
- batch 2: target `80`, scraped `79`, failed `0`, new prices `6`, stock `8`, specs `23`
|
||||
- batch 3: target `90`, scraped `89`, failed `0`, new prices `21`, stock `24`, specs `47`
|
||||
- batch 4 closure: target `42`, scraped `42`, failed `0`, new prices `5`, stock `3`, specs `25`
|
||||
- all runs used Playwright concurrency `1`, `nice -n 10`, and no broad category crawl
|
||||
- Erik/TIP health after closure:
|
||||
- status: `healthy`
|
||||
- load status: `ok`
|
||||
- memory used: `13%`
|
||||
- transceivers: `17647`
|
||||
- vendors: `478`
|
||||
- switches: `680`
|
||||
- global verified counters:
|
||||
- price: `11557`
|
||||
- image: `10636`
|
||||
- details: `9816`
|
||||
- fully: `8522`
|
||||
- FS.com before targeted detail batches:
|
||||
- total rows: `383`
|
||||
- price verified: `379`
|
||||
- image verified: `299`
|
||||
- details verified: `108`
|
||||
- price+image+details: `108`
|
||||
- fully verified: `3`
|
||||
- missing product URL: `76`
|
||||
- missing image URL: `84`
|
||||
- missing reach label: `9`
|
||||
- missing fiber type: `323`
|
||||
- HTML product-like complete rows: `106`
|
||||
- FS.com after closure:
|
||||
- total rows: `383`
|
||||
- price verified: `379`
|
||||
- image verified: `299`
|
||||
- details verified: `260`
|
||||
- price+image+details: `260`
|
||||
- fully verified: `205`
|
||||
- missing product URL: `76`
|
||||
- missing image URL: `84`
|
||||
- missing reach label: `9`
|
||||
- missing fiber type: `123`
|
||||
- HTML product-like rows:
|
||||
- total `299`
|
||||
- price `299`
|
||||
- image `282`
|
||||
- details `258`
|
||||
- complete `258`
|
||||
- no-url rows:
|
||||
- total `76`
|
||||
- price `76`
|
||||
- image `15`
|
||||
- details `0`
|
||||
- category rows:
|
||||
- total `4`
|
||||
- no verified signals
|
||||
- interpretation / next strategy:
|
||||
- the DB-detail-only approach is now mostly exhausted
|
||||
- the fourth clean closure batch did not raise `details_verified`; it only nudged `fully_verified` from `199` to `205`
|
||||
- do not keep repeating the same FS.com detail crawler on Erik
|
||||
- next FS.com work should be:
|
||||
- source-discovery/classification robot for the `76` no-url rows
|
||||
- parser/source diagnostics for the remaining `41` HTML product-like rows missing detail/fiber/image signals
|
||||
- likely separate handling for malformed or historical `/de/de/products/...` URLs and pages that return no useful text
|
||||
- TIPLLM training pool:
|
||||
- all four FS.com batches were written and pushed to Gitea
|
||||
- latest training commits:
|
||||
- `28cac05` batch 1
|
||||
- `a0a6be3` batch 2
|
||||
- `38736ae` batch 3
|
||||
- `2c25bf3` closure batch
|
||||
- important truth:
|
||||
- do not claim FS.com is complete
|
||||
- the honest current claim is: FS.com product-like coverage improved strongly, but `258/299` HTML product-like rows are complete and `76` no-url rows still need source discovery/classification
|
||||
|
||||
- TIP Flexoptix completion push on 2026-05-09:
|
||||
- operator said "feuer frei" after confirming Flexoptix was not yet complete
|
||||
- TIPLLM training pool was updated immediately with the truth rule:
|
||||
|
||||
101
sync/history/2026-05-09-fscom-targeted-verification-push.md
Normal file
101
sync/history/2026-05-09-fscom-targeted-verification-push.md
Normal file
@ -0,0 +1,101 @@
|
||||
# FS.com / Fiberstore Targeted Verification Push
|
||||
|
||||
Date: 2026-05-09
|
||||
|
||||
## Intent
|
||||
|
||||
Continue TIP data completion for FS.com/Fiberstore after Flexoptix. The operator requested price, image and product information to be researched deeply enough to avoid manual validation, while keeping Erik safe and writing every crawler/scraper/robot learning into the TIPLLM training pool.
|
||||
|
||||
## Code Changed
|
||||
|
||||
- `packages/scraper/src/scrapers/fs-com.ts`
|
||||
- added `FS_DB_DETAIL_ONLY=1`
|
||||
- targets existing FS.COM DB product URLs with missing verification signals
|
||||
- avoids broad category discovery while known product URLs still need work
|
||||
- improved reach parsing for comma/decimal values
|
||||
- added deterministic fiber type fallback from product name, part number and specs
|
||||
- writes product URL to `transceivers.product_page_url`
|
||||
- stores the real FS.com product URL as detail verification source
|
||||
|
||||
## Live Runs
|
||||
|
||||
All runs were on Erik with:
|
||||
|
||||
- Playwright concurrency `1`
|
||||
- `nice -n 10`
|
||||
- no broad category crawl
|
||||
- DB-detail-only mode
|
||||
|
||||
Batch results:
|
||||
|
||||
- Batch 1: target `80`, scraped `80`, failed `0`, new prices `17`, stock `18`, specs `24`
|
||||
- Batch 2: target `80`, scraped `79`, failed `0`, new prices `6`, stock `8`, specs `23`
|
||||
- Batch 3: target `90`, scraped `89`, failed `0`, new prices `21`, stock `24`, specs `47`
|
||||
- Batch 4 closure: target `42`, scraped `42`, failed `0`, new prices `5`, stock `3`, specs `25`
|
||||
|
||||
`pnpm -C packages/scraper build` passed on Erik after the scraper change.
|
||||
|
||||
## FS.com Counters
|
||||
|
||||
Before:
|
||||
|
||||
- total rows: `383`
|
||||
- price verified: `379`
|
||||
- image verified: `299`
|
||||
- details verified: `108`
|
||||
- price+image+details: `108`
|
||||
- fully verified: `3`
|
||||
- missing URL: `76`
|
||||
- missing image URL: `84`
|
||||
- missing reach label: `9`
|
||||
- missing fiber type: `323`
|
||||
- HTML product-like complete: `106`
|
||||
|
||||
After closure:
|
||||
|
||||
- total rows: `383`
|
||||
- price verified: `379`
|
||||
- image verified: `299`
|
||||
- details verified: `260`
|
||||
- price+image+details: `260`
|
||||
- fully verified: `205`
|
||||
- missing URL: `76`
|
||||
- missing image URL: `84`
|
||||
- missing reach label: `9`
|
||||
- missing fiber type: `123`
|
||||
- HTML product-like rows: `299`
|
||||
- HTML product-like complete: `258`
|
||||
- no-url rows: `76`
|
||||
- category rows: `4`
|
||||
|
||||
TIP health after closure:
|
||||
|
||||
- status: `healthy`
|
||||
- load status: `ok`
|
||||
- memory used: `13%`
|
||||
- transceivers: `17647`
|
||||
- vendors: `478`
|
||||
- switches: `680`
|
||||
- fully verified globally: `8522`
|
||||
|
||||
## Training Pool
|
||||
|
||||
FS.com batches were written to `/tmp/tip-training-data` and pushed to Gitea.
|
||||
|
||||
Training pool commits:
|
||||
|
||||
- `28cac05 crawl: add fscom db detail batch learning record`
|
||||
- `a0a6be3 crawl: add fscom db detail batch 2 learning record`
|
||||
- `38736ae crawl: add fscom db detail batch 3 learning record`
|
||||
- `2c25bf3 crawl: add fscom db detail closure learning record`
|
||||
|
||||
## Next
|
||||
|
||||
Do not repeat the same DB-detail-only FS.com crawler on Erik. The fourth clean closure batch did not increase `details_verified`, so the remaining gaps need a different strategy:
|
||||
|
||||
- source-discovery/classification for `76` no-url rows
|
||||
- parser/source diagnostics for the remaining `41` HTML product-like rows missing details or fiber/image signals
|
||||
- explicit classification for `4` category rows
|
||||
- likely cleanup of historical/malformed `/de/de/products/...` URLs and no-text pages
|
||||
|
||||
Truth rule: do not claim FS.com is complete. Current honest status is `258/299` HTML product-like rows complete and `205/383` fully verified overall.
|
||||
Loading…
x
Reference in New Issue
Block a user