transceiver-db/sync/history/2026-05-09-crawlee-evaluation-and-fscom-url-discovery.md
2026-05-09 14:00:37 +02:00

2.8 KiB

Crawlee Evaluation and FS.com URL Discovery

Date: 2026-05-09

Question

Operator asked with highest priority whether these repositories help TIP:

  • https://github.com/apify/crawlee
  • https://github.com/apify/crawlee-python
  • https://github.com/hiteshchoudhary/crawlee-project

Evaluation

apify/crawlee helps directly, but TIP already uses it in the TypeScript scraper stack. The priority is to harden our current usage rather than introduce a new crawler framework.

Best immediate Crawlee practices for TIP:

  • keep per-vendor bounded runs
  • use stable uniqueKey/target IDs so retries do not create duplicate rows
  • keep Crawlee storage directories isolated per vendor/run class
  • record no-text and max-retry URLs as a separate retry class
  • use AutoscaledPool telemetry as a safety signal
  • keep Erik at low concurrency and move heavier work to Pi/Proxmox workers

apify/crawlee-python is useful for future isolated worker experiments on Pi/Proxmox, especially where Python extraction libraries help. It should not replace the current TypeScript crawler core today.

hiteshchoudhary/crawlee-project is a small community/demo app, not a production building block for TIP.

Code

Changed:

  • packages/scraper/src/scrapers/fs-com.ts

Added:

  • FS_URL_DISCOVERY_ONLY=1
  • target row propagation with targetTransceiverId
  • image verification for target rows
  • H1/part/spec deterministic detail verification when FS.com lacks a spec table

Live Runs

URL discovery pilot:

  • target 20
  • scraped 19
  • failed 0
  • no-url rows: 76 -> 57

Full URL discovery:

  • target 56
  • scraped 55
  • failed 1
  • failed URL: https://www.fs.com/de/products/229461.html
  • no-url rows: 57 -> 2

DB reconciliation:

  • target 57
  • scraped 55
  • failed 0
  • new prices 41
  • stock observations 40
  • specs verified 55

Build:

  • pnpm -C packages/scraper build passed on Erik

FS.com Final State

  • total rows: 383
  • price verified: 379
  • image verified: 374
  • details verified: 373
  • price+image+details: 373
  • fully verified: 205
  • missing URL: 2
  • missing image URL: 9
  • missing reach label: 4
  • missing fiber type: 9
  • HTML product-like rows: 373
  • HTML product-like complete: 371
  • no-url rows: 2
  • category rows: 4

Remaining no-url rows:

  • Change
  • FS-229461

TIP health after run:

  • status: healthy
  • load status: ok
  • memory used: 13%
  • global image verified: 10711
  • global details verified: 9929
  • global fully verified: 8526

Training Pool

Pushed:

  • 4d9a11c crawl: add fscom url discovery learning record

Next

Do not claim FS.com is 100% complete yet. Remaining work:

  • classify Change
  • retry or classify FS-229461
  • classify 4 category rows
  • close 9 image/fiber gaps
  • then move to next high-value competitor with the same bounded Crawlee pattern