sync: record crawlee evaluation and fscom url discovery

2026-05-09 14:00:37 +02:00 · 2026-05-09 14:00:37 +02:00 · 6ee10bf301
commit 6ee10bf301
parent 3d79f6b8e0
2 changed files with 196 additions and 1 deletions
--- a/sync/CURRENT.md
+++ b/sync/CURRENT.md
@ -1,9 +1,86 @@
 # Current TIP Sync State

-Updated: 2026-05-09 09:18 UTC
+Updated: 2026-05-09 11:59 UTC

 ## Newest Work

+- Priority Crawlee evaluation + FS.com URL discovery on 2026-05-09:
+  - operator asked whether these repos help:
+    - `https://github.com/apify/crawlee`
+    - `https://github.com/apify/crawlee-python`
+    - `https://github.com/hiteshchoudhary/crawlee-project`
+  - evaluation:
+    - `apify/crawlee` is directly relevant and already in use in TIP via TypeScript `PlaywrightCrawler`
+    - current TIP benefit is not adding Crawlee, but using Crawlee more deliberately:
+      - bounded RequestQueues
+      - stable `uniqueKey`
+      - explicit retry/no-text classes
+      - isolated storage directories
+      - AutoscaledPool telemetry as safety signal
+      - hard concurrency caps on Erik
+    - `apify/crawlee-python` is useful for future isolated Pi/Proxmox workers, especially for Python-native extraction experiments, but should not replace the current TypeScript scraper core today
+    - `hiteshchoudhary/crawlee-project` is a small community/demo project, useful as inspiration only; not a production dependency for TIP
+  - code improved:
+    - `packages/scraper/src/scrapers/fs-com.ts`
+      - added `FS_URL_DISCOVERY_ONLY=1`
+      - maps existing `FS-<numeric-id>` rows without `product_page_url` to `https://www.fs.com/de/products/<id>.html`
+      - carries `targetTransceiverId` through the crawler so verified source evidence updates the original row instead of creating duplicates
+      - marks current FS.com product images verified for target rows
+      - accepts deterministic H1/part/spec evidence for detail verification when FS.com does not expose a traditional spec table
+  - live runs on Erik:
+    - URL discovery pilot:
+      - target `20`
+      - scraped `19`
+      - failed `0`
+      - no-url rows dropped from `76` to `57`
+    - full URL discovery:
+      - target `56`
+      - scraped `55`
+      - failed `1` (`https://www.fs.com/de/products/229461.html`, transient `ERR_NETWORK_CHANGED`)
+      - no-url rows dropped to `2`
+    - DB reconciliation with improved detail evidence:
+      - target `57`
+      - scraped `55`
+      - failed `0`
+      - new prices `41`
+      - stock observations `40`
+      - specs verified `55`
+    - `pnpm -C packages/scraper build` passed on Erik after the code change
+  - FS.com final state after URL discovery:
+    - total rows: `383`
+    - price verified: `379`
+    - image verified: `374`
+    - details verified: `373`
+    - price+image+details: `373`
+    - fully verified: `205`
+    - missing URL: `2`
+    - missing image URL: `9`
+    - missing reach label: `4`
+    - missing fiber type: `9`
+    - HTML product-like rows:
+      - total `373`
+      - image `372`
+      - details `371`
+      - complete `371`
+    - no-url rows:
+      - `Change`
+      - `FS-229461`
+    - category rows: `4`
+  - TIP health after run:
+    - status `healthy`
+    - load status `ok`
+    - memory used `13%`
+    - global verified counters:
+      - price `11557`
+      - image `10711`
+      - details `9929`
+      - fully `8526`
+  - training pool:
+    - pushed `4d9a11c crawl: add fscom url discovery learning record`
+  - truth:
+    - FS.com is still not 100% complete
+    - honest current claim: `371/373` HTML product-like rows complete; remaining work is small and classifiable
+
 - TIP FS.com / Fiberstore targeted verification push on 2026-05-09:
  - operator requested FS.com/Fiberstore next, with all crawler/scraper/robot learnings written to the TIPLLM training pool and no external AI
  - code improved:
--- a/sync/history/2026-05-09-crawlee-evaluation-and-fscom-url-discovery.md
+++ b/sync/history/2026-05-09-crawlee-evaluation-and-fscom-url-discovery.md
@ -0,0 +1,118 @@
+# Crawlee Evaluation and FS.com URL Discovery
+
+Date: 2026-05-09
+
+## Question
+
+Operator asked with highest priority whether these repositories help TIP:
+
+- `https://github.com/apify/crawlee`
+- `https://github.com/apify/crawlee-python`
+- `https://github.com/hiteshchoudhary/crawlee-project`
+
+## Evaluation
+
+`apify/crawlee` helps directly, but TIP already uses it in the TypeScript scraper stack. The priority is to harden our current usage rather than introduce a new crawler framework.
+
+Best immediate Crawlee practices for TIP:
+
+- keep per-vendor bounded runs
+- use stable `uniqueKey`/target IDs so retries do not create duplicate rows
+- keep Crawlee storage directories isolated per vendor/run class
+- record no-text and max-retry URLs as a separate retry class
+- use AutoscaledPool telemetry as a safety signal
+- keep Erik at low concurrency and move heavier work to Pi/Proxmox workers
+
+`apify/crawlee-python` is useful for future isolated worker experiments on Pi/Proxmox, especially where Python extraction libraries help. It should not replace the current TypeScript crawler core today.
+
+`hiteshchoudhary/crawlee-project` is a small community/demo app, not a production building block for TIP.
+
+## Code
+
+Changed:
+
+- `packages/scraper/src/scrapers/fs-com.ts`
+
+Added:
+
+- `FS_URL_DISCOVERY_ONLY=1`
+- target row propagation with `targetTransceiverId`
+- image verification for target rows
+- H1/part/spec deterministic detail verification when FS.com lacks a spec table
+
+## Live Runs
+
+URL discovery pilot:
+
+- target `20`
+- scraped `19`
+- failed `0`
+- no-url rows: `76` -> `57`
+
+Full URL discovery:
+
+- target `56`
+- scraped `55`
+- failed `1`
+- failed URL: `https://www.fs.com/de/products/229461.html`
+- no-url rows: `57` -> `2`
+
+DB reconciliation:
+
+- target `57`
+- scraped `55`
+- failed `0`
+- new prices `41`
+- stock observations `40`
+- specs verified `55`
+
+Build:
+
+- `pnpm -C packages/scraper build` passed on Erik
+
+## FS.com Final State
+
+- total rows: `383`
+- price verified: `379`
+- image verified: `374`
+- details verified: `373`
+- price+image+details: `373`
+- fully verified: `205`
+- missing URL: `2`
+- missing image URL: `9`
+- missing reach label: `4`
+- missing fiber type: `9`
+- HTML product-like rows: `373`
+- HTML product-like complete: `371`
+- no-url rows: `2`
+- category rows: `4`
+
+Remaining no-url rows:
+
+- `Change`
+- `FS-229461`
+
+TIP health after run:
+
+- status: `healthy`
+- load status: `ok`
+- memory used: `13%`
+- global image verified: `10711`
+- global details verified: `9929`
+- global fully verified: `8526`
+
+## Training Pool
+
+Pushed:
+
+- `4d9a11c crawl: add fscom url discovery learning record`
+
+## Next
+
+Do not claim FS.com is 100% complete yet. Remaining work:
+
+- classify `Change`
+- retry or classify `FS-229461`
+- classify 4 category rows
+- close 9 image/fiber gaps
+- then move to next high-value competitor with the same bounded Crawlee pattern