2.8 KiB
Crawlee Evaluation and FS.com URL Discovery
Date: 2026-05-09
Question
Operator asked with highest priority whether these repositories help TIP:
https://github.com/apify/crawleehttps://github.com/apify/crawlee-pythonhttps://github.com/hiteshchoudhary/crawlee-project
Evaluation
apify/crawlee helps directly, but TIP already uses it in the TypeScript scraper stack. The priority is to harden our current usage rather than introduce a new crawler framework.
Best immediate Crawlee practices for TIP:
- keep per-vendor bounded runs
- use stable
uniqueKey/target IDs so retries do not create duplicate rows - keep Crawlee storage directories isolated per vendor/run class
- record no-text and max-retry URLs as a separate retry class
- use AutoscaledPool telemetry as a safety signal
- keep Erik at low concurrency and move heavier work to Pi/Proxmox workers
apify/crawlee-python is useful for future isolated worker experiments on Pi/Proxmox, especially where Python extraction libraries help. It should not replace the current TypeScript crawler core today.
hiteshchoudhary/crawlee-project is a small community/demo app, not a production building block for TIP.
Code
Changed:
packages/scraper/src/scrapers/fs-com.ts
Added:
FS_URL_DISCOVERY_ONLY=1- target row propagation with
targetTransceiverId - image verification for target rows
- H1/part/spec deterministic detail verification when FS.com lacks a spec table
Live Runs
URL discovery pilot:
- target
20 - scraped
19 - failed
0 - no-url rows:
76->57
Full URL discovery:
- target
56 - scraped
55 - failed
1 - failed URL:
https://www.fs.com/de/products/229461.html - no-url rows:
57->2
DB reconciliation:
- target
57 - scraped
55 - failed
0 - new prices
41 - stock observations
40 - specs verified
55
Build:
pnpm -C packages/scraper buildpassed on Erik
FS.com Final State
- total rows:
383 - price verified:
379 - image verified:
374 - details verified:
373 - price+image+details:
373 - fully verified:
205 - missing URL:
2 - missing image URL:
9 - missing reach label:
4 - missing fiber type:
9 - HTML product-like rows:
373 - HTML product-like complete:
371 - no-url rows:
2 - category rows:
4
Remaining no-url rows:
ChangeFS-229461
TIP health after run:
- status:
healthy - load status:
ok - memory used:
13% - global image verified:
10711 - global details verified:
9929 - global fully verified:
8526
Training Pool
Pushed:
4d9a11c crawl: add fscom url discovery learning record
Next
Do not claim FS.com is 100% complete yet. Remaining work:
- classify
Change - retry or classify
FS-229461 - classify 4 category rows
- close 9 image/fiber gaps
- then move to next high-value competitor with the same bounded Crawlee pattern