transceiver-db/sync/history/2026-05-09-crawlee-evaluation-and-fscom-url-discovery.md

# Crawlee Evaluation and FS.com URL Discovery

Date: 2026-05-09

## Question

Operator asked with highest priority whether these repositories help TIP:

- `https://github.com/apify/crawlee`
- `https://github.com/apify/crawlee-python`
- `https://github.com/hiteshchoudhary/crawlee-project`

## Evaluation

`apify/crawlee` helps directly, but TIP already uses it in the TypeScript scraper stack. The priority is to harden our current usage rather than introduce a new crawler framework.

Best immediate Crawlee practices for TIP:

- keep per-vendor bounded runs
- use stable `uniqueKey`/target IDs so retries do not create duplicate rows
- keep Crawlee storage directories isolated per vendor/run class
- record no-text and max-retry URLs as a separate retry class
- use AutoscaledPool telemetry as a safety signal
- keep Erik at low concurrency and move heavier work to Pi/Proxmox workers

`apify/crawlee-python` is useful for future isolated worker experiments on Pi/Proxmox, especially where Python extraction libraries help. It should not replace the current TypeScript crawler core today.

`hiteshchoudhary/crawlee-project` is a small community/demo app, not a production building block for TIP.

## Code

Changed:

- `packages/scraper/src/scrapers/fs-com.ts`

Added:

- `FS_URL_DISCOVERY_ONLY=1`
- target row propagation with `targetTransceiverId`
- image verification for target rows
- H1/part/spec deterministic detail verification when FS.com lacks a spec table

## Live Runs

URL discovery pilot:

- target `20`
- scraped `19`
- failed `0`
- no-url rows: `76` -> `57`

Full URL discovery:

- target `56`
- scraped `55`
- failed `1`
- failed URL: `https://www.fs.com/de/products/229461.html`
- no-url rows: `57` -> `2`

DB reconciliation:

- target `57`
- scraped `55`
- failed `0`
- new prices `41`
- stock observations `40`
- specs verified `55`

Build:

- `pnpm -C packages/scraper build` passed on Erik

## FS.com Final State

- total rows: `383`
- price verified: `379`
- image verified: `374`
- details verified: `373`
- price+image+details: `373`
- fully verified: `205`
- missing URL: `2`
- missing image URL: `9`
- missing reach label: `4`
- missing fiber type: `9`
- HTML product-like rows: `373`
- HTML product-like complete: `371`
- no-url rows: `2`
- category rows: `4`

Remaining no-url rows:

- `Change`
- `FS-229461`

TIP health after run:

- status: `healthy`
- load status: `ok`
- memory used: `13%`
- global image verified: `10711`
- global details verified: `9929`
- global fully verified: `8526`

## Training Pool

Pushed:

- `4d9a11c crawl: add fscom url discovery learning record`

## Next

Do not claim FS.com is 100% complete yet. Remaining work:

- classify `Change`
- retry or classify `FS-229461`
- classify 4 category rows
- close 9 image/fiber gaps
- then move to next high-value competitor with the same bounded Crawlee pattern