From 6ee10bf3011285171bbae5cea07807894a010901 Mon Sep 17 00:00:00 2001 From: Rene Fichtmueller Date: Sat, 9 May 2026 14:00:37 +0200 Subject: [PATCH] sync: record crawlee evaluation and fscom url discovery --- sync/CURRENT.md | 79 +++++++++++- ...wlee-evaluation-and-fscom-url-discovery.md | 118 ++++++++++++++++++ 2 files changed, 196 insertions(+), 1 deletion(-) create mode 100644 sync/history/2026-05-09-crawlee-evaluation-and-fscom-url-discovery.md diff --git a/sync/CURRENT.md b/sync/CURRENT.md index 61e6beb..333e78b 100644 --- a/sync/CURRENT.md +++ b/sync/CURRENT.md @@ -1,9 +1,86 @@ # Current TIP Sync State -Updated: 2026-05-09 09:18 UTC +Updated: 2026-05-09 11:59 UTC ## Newest Work +- Priority Crawlee evaluation + FS.com URL discovery on 2026-05-09: + - operator asked whether these repos help: + - `https://github.com/apify/crawlee` + - `https://github.com/apify/crawlee-python` + - `https://github.com/hiteshchoudhary/crawlee-project` + - evaluation: + - `apify/crawlee` is directly relevant and already in use in TIP via TypeScript `PlaywrightCrawler` + - current TIP benefit is not adding Crawlee, but using Crawlee more deliberately: + - bounded RequestQueues + - stable `uniqueKey` + - explicit retry/no-text classes + - isolated storage directories + - AutoscaledPool telemetry as safety signal + - hard concurrency caps on Erik + - `apify/crawlee-python` is useful for future isolated Pi/Proxmox workers, especially for Python-native extraction experiments, but should not replace the current TypeScript scraper core today + - `hiteshchoudhary/crawlee-project` is a small community/demo project, useful as inspiration only; not a production dependency for TIP + - code improved: + - `packages/scraper/src/scrapers/fs-com.ts` + - added `FS_URL_DISCOVERY_ONLY=1` + - maps existing `FS-` rows without `product_page_url` to `https://www.fs.com/de/products/.html` + - carries `targetTransceiverId` through the crawler so verified source evidence updates the original row instead of creating duplicates + - marks current FS.com product images verified for target rows + - accepts deterministic H1/part/spec evidence for detail verification when FS.com does not expose a traditional spec table + - live runs on Erik: + - URL discovery pilot: + - target `20` + - scraped `19` + - failed `0` + - no-url rows dropped from `76` to `57` + - full URL discovery: + - target `56` + - scraped `55` + - failed `1` (`https://www.fs.com/de/products/229461.html`, transient `ERR_NETWORK_CHANGED`) + - no-url rows dropped to `2` + - DB reconciliation with improved detail evidence: + - target `57` + - scraped `55` + - failed `0` + - new prices `41` + - stock observations `40` + - specs verified `55` + - `pnpm -C packages/scraper build` passed on Erik after the code change + - FS.com final state after URL discovery: + - total rows: `383` + - price verified: `379` + - image verified: `374` + - details verified: `373` + - price+image+details: `373` + - fully verified: `205` + - missing URL: `2` + - missing image URL: `9` + - missing reach label: `4` + - missing fiber type: `9` + - HTML product-like rows: + - total `373` + - image `372` + - details `371` + - complete `371` + - no-url rows: + - `Change` + - `FS-229461` + - category rows: `4` + - TIP health after run: + - status `healthy` + - load status `ok` + - memory used `13%` + - global verified counters: + - price `11557` + - image `10711` + - details `9929` + - fully `8526` + - training pool: + - pushed `4d9a11c crawl: add fscom url discovery learning record` + - truth: + - FS.com is still not 100% complete + - honest current claim: `371/373` HTML product-like rows complete; remaining work is small and classifiable + - TIP FS.com / Fiberstore targeted verification push on 2026-05-09: - operator requested FS.com/Fiberstore next, with all crawler/scraper/robot learnings written to the TIPLLM training pool and no external AI - code improved: diff --git a/sync/history/2026-05-09-crawlee-evaluation-and-fscom-url-discovery.md b/sync/history/2026-05-09-crawlee-evaluation-and-fscom-url-discovery.md new file mode 100644 index 0000000..f85b10c --- /dev/null +++ b/sync/history/2026-05-09-crawlee-evaluation-and-fscom-url-discovery.md @@ -0,0 +1,118 @@ +# Crawlee Evaluation and FS.com URL Discovery + +Date: 2026-05-09 + +## Question + +Operator asked with highest priority whether these repositories help TIP: + +- `https://github.com/apify/crawlee` +- `https://github.com/apify/crawlee-python` +- `https://github.com/hiteshchoudhary/crawlee-project` + +## Evaluation + +`apify/crawlee` helps directly, but TIP already uses it in the TypeScript scraper stack. The priority is to harden our current usage rather than introduce a new crawler framework. + +Best immediate Crawlee practices for TIP: + +- keep per-vendor bounded runs +- use stable `uniqueKey`/target IDs so retries do not create duplicate rows +- keep Crawlee storage directories isolated per vendor/run class +- record no-text and max-retry URLs as a separate retry class +- use AutoscaledPool telemetry as a safety signal +- keep Erik at low concurrency and move heavier work to Pi/Proxmox workers + +`apify/crawlee-python` is useful for future isolated worker experiments on Pi/Proxmox, especially where Python extraction libraries help. It should not replace the current TypeScript crawler core today. + +`hiteshchoudhary/crawlee-project` is a small community/demo app, not a production building block for TIP. + +## Code + +Changed: + +- `packages/scraper/src/scrapers/fs-com.ts` + +Added: + +- `FS_URL_DISCOVERY_ONLY=1` +- target row propagation with `targetTransceiverId` +- image verification for target rows +- H1/part/spec deterministic detail verification when FS.com lacks a spec table + +## Live Runs + +URL discovery pilot: + +- target `20` +- scraped `19` +- failed `0` +- no-url rows: `76` -> `57` + +Full URL discovery: + +- target `56` +- scraped `55` +- failed `1` +- failed URL: `https://www.fs.com/de/products/229461.html` +- no-url rows: `57` -> `2` + +DB reconciliation: + +- target `57` +- scraped `55` +- failed `0` +- new prices `41` +- stock observations `40` +- specs verified `55` + +Build: + +- `pnpm -C packages/scraper build` passed on Erik + +## FS.com Final State + +- total rows: `383` +- price verified: `379` +- image verified: `374` +- details verified: `373` +- price+image+details: `373` +- fully verified: `205` +- missing URL: `2` +- missing image URL: `9` +- missing reach label: `4` +- missing fiber type: `9` +- HTML product-like rows: `373` +- HTML product-like complete: `371` +- no-url rows: `2` +- category rows: `4` + +Remaining no-url rows: + +- `Change` +- `FS-229461` + +TIP health after run: + +- status: `healthy` +- load status: `ok` +- memory used: `13%` +- global image verified: `10711` +- global details verified: `9929` +- global fully verified: `8526` + +## Training Pool + +Pushed: + +- `4d9a11c crawl: add fscom url discovery learning record` + +## Next + +Do not claim FS.com is 100% complete yet. Remaining work: + +- classify `Change` +- retry or classify `FS-229461` +- classify 4 category rows +- close 9 image/fiber gaps +- then move to next high-value competitor with the same bounded Crawlee pattern