sync: record crawlee evaluation and fscom url discovery
This commit is contained in:
parent
3d79f6b8e0
commit
6ee10bf301
@ -1,9 +1,86 @@
|
|||||||
# Current TIP Sync State
|
# Current TIP Sync State
|
||||||
|
|
||||||
Updated: 2026-05-09 09:18 UTC
|
Updated: 2026-05-09 11:59 UTC
|
||||||
|
|
||||||
## Newest Work
|
## Newest Work
|
||||||
|
|
||||||
|
- Priority Crawlee evaluation + FS.com URL discovery on 2026-05-09:
|
||||||
|
- operator asked whether these repos help:
|
||||||
|
- `https://github.com/apify/crawlee`
|
||||||
|
- `https://github.com/apify/crawlee-python`
|
||||||
|
- `https://github.com/hiteshchoudhary/crawlee-project`
|
||||||
|
- evaluation:
|
||||||
|
- `apify/crawlee` is directly relevant and already in use in TIP via TypeScript `PlaywrightCrawler`
|
||||||
|
- current TIP benefit is not adding Crawlee, but using Crawlee more deliberately:
|
||||||
|
- bounded RequestQueues
|
||||||
|
- stable `uniqueKey`
|
||||||
|
- explicit retry/no-text classes
|
||||||
|
- isolated storage directories
|
||||||
|
- AutoscaledPool telemetry as safety signal
|
||||||
|
- hard concurrency caps on Erik
|
||||||
|
- `apify/crawlee-python` is useful for future isolated Pi/Proxmox workers, especially for Python-native extraction experiments, but should not replace the current TypeScript scraper core today
|
||||||
|
- `hiteshchoudhary/crawlee-project` is a small community/demo project, useful as inspiration only; not a production dependency for TIP
|
||||||
|
- code improved:
|
||||||
|
- `packages/scraper/src/scrapers/fs-com.ts`
|
||||||
|
- added `FS_URL_DISCOVERY_ONLY=1`
|
||||||
|
- maps existing `FS-<numeric-id>` rows without `product_page_url` to `https://www.fs.com/de/products/<id>.html`
|
||||||
|
- carries `targetTransceiverId` through the crawler so verified source evidence updates the original row instead of creating duplicates
|
||||||
|
- marks current FS.com product images verified for target rows
|
||||||
|
- accepts deterministic H1/part/spec evidence for detail verification when FS.com does not expose a traditional spec table
|
||||||
|
- live runs on Erik:
|
||||||
|
- URL discovery pilot:
|
||||||
|
- target `20`
|
||||||
|
- scraped `19`
|
||||||
|
- failed `0`
|
||||||
|
- no-url rows dropped from `76` to `57`
|
||||||
|
- full URL discovery:
|
||||||
|
- target `56`
|
||||||
|
- scraped `55`
|
||||||
|
- failed `1` (`https://www.fs.com/de/products/229461.html`, transient `ERR_NETWORK_CHANGED`)
|
||||||
|
- no-url rows dropped to `2`
|
||||||
|
- DB reconciliation with improved detail evidence:
|
||||||
|
- target `57`
|
||||||
|
- scraped `55`
|
||||||
|
- failed `0`
|
||||||
|
- new prices `41`
|
||||||
|
- stock observations `40`
|
||||||
|
- specs verified `55`
|
||||||
|
- `pnpm -C packages/scraper build` passed on Erik after the code change
|
||||||
|
- FS.com final state after URL discovery:
|
||||||
|
- total rows: `383`
|
||||||
|
- price verified: `379`
|
||||||
|
- image verified: `374`
|
||||||
|
- details verified: `373`
|
||||||
|
- price+image+details: `373`
|
||||||
|
- fully verified: `205`
|
||||||
|
- missing URL: `2`
|
||||||
|
- missing image URL: `9`
|
||||||
|
- missing reach label: `4`
|
||||||
|
- missing fiber type: `9`
|
||||||
|
- HTML product-like rows:
|
||||||
|
- total `373`
|
||||||
|
- image `372`
|
||||||
|
- details `371`
|
||||||
|
- complete `371`
|
||||||
|
- no-url rows:
|
||||||
|
- `Change`
|
||||||
|
- `FS-229461`
|
||||||
|
- category rows: `4`
|
||||||
|
- TIP health after run:
|
||||||
|
- status `healthy`
|
||||||
|
- load status `ok`
|
||||||
|
- memory used `13%`
|
||||||
|
- global verified counters:
|
||||||
|
- price `11557`
|
||||||
|
- image `10711`
|
||||||
|
- details `9929`
|
||||||
|
- fully `8526`
|
||||||
|
- training pool:
|
||||||
|
- pushed `4d9a11c crawl: add fscom url discovery learning record`
|
||||||
|
- truth:
|
||||||
|
- FS.com is still not 100% complete
|
||||||
|
- honest current claim: `371/373` HTML product-like rows complete; remaining work is small and classifiable
|
||||||
|
|
||||||
- TIP FS.com / Fiberstore targeted verification push on 2026-05-09:
|
- TIP FS.com / Fiberstore targeted verification push on 2026-05-09:
|
||||||
- operator requested FS.com/Fiberstore next, with all crawler/scraper/robot learnings written to the TIPLLM training pool and no external AI
|
- operator requested FS.com/Fiberstore next, with all crawler/scraper/robot learnings written to the TIPLLM training pool and no external AI
|
||||||
- code improved:
|
- code improved:
|
||||||
|
|||||||
@ -0,0 +1,118 @@
|
|||||||
|
# Crawlee Evaluation and FS.com URL Discovery
|
||||||
|
|
||||||
|
Date: 2026-05-09
|
||||||
|
|
||||||
|
## Question
|
||||||
|
|
||||||
|
Operator asked with highest priority whether these repositories help TIP:
|
||||||
|
|
||||||
|
- `https://github.com/apify/crawlee`
|
||||||
|
- `https://github.com/apify/crawlee-python`
|
||||||
|
- `https://github.com/hiteshchoudhary/crawlee-project`
|
||||||
|
|
||||||
|
## Evaluation
|
||||||
|
|
||||||
|
`apify/crawlee` helps directly, but TIP already uses it in the TypeScript scraper stack. The priority is to harden our current usage rather than introduce a new crawler framework.
|
||||||
|
|
||||||
|
Best immediate Crawlee practices for TIP:
|
||||||
|
|
||||||
|
- keep per-vendor bounded runs
|
||||||
|
- use stable `uniqueKey`/target IDs so retries do not create duplicate rows
|
||||||
|
- keep Crawlee storage directories isolated per vendor/run class
|
||||||
|
- record no-text and max-retry URLs as a separate retry class
|
||||||
|
- use AutoscaledPool telemetry as a safety signal
|
||||||
|
- keep Erik at low concurrency and move heavier work to Pi/Proxmox workers
|
||||||
|
|
||||||
|
`apify/crawlee-python` is useful for future isolated worker experiments on Pi/Proxmox, especially where Python extraction libraries help. It should not replace the current TypeScript crawler core today.
|
||||||
|
|
||||||
|
`hiteshchoudhary/crawlee-project` is a small community/demo app, not a production building block for TIP.
|
||||||
|
|
||||||
|
## Code
|
||||||
|
|
||||||
|
Changed:
|
||||||
|
|
||||||
|
- `packages/scraper/src/scrapers/fs-com.ts`
|
||||||
|
|
||||||
|
Added:
|
||||||
|
|
||||||
|
- `FS_URL_DISCOVERY_ONLY=1`
|
||||||
|
- target row propagation with `targetTransceiverId`
|
||||||
|
- image verification for target rows
|
||||||
|
- H1/part/spec deterministic detail verification when FS.com lacks a spec table
|
||||||
|
|
||||||
|
## Live Runs
|
||||||
|
|
||||||
|
URL discovery pilot:
|
||||||
|
|
||||||
|
- target `20`
|
||||||
|
- scraped `19`
|
||||||
|
- failed `0`
|
||||||
|
- no-url rows: `76` -> `57`
|
||||||
|
|
||||||
|
Full URL discovery:
|
||||||
|
|
||||||
|
- target `56`
|
||||||
|
- scraped `55`
|
||||||
|
- failed `1`
|
||||||
|
- failed URL: `https://www.fs.com/de/products/229461.html`
|
||||||
|
- no-url rows: `57` -> `2`
|
||||||
|
|
||||||
|
DB reconciliation:
|
||||||
|
|
||||||
|
- target `57`
|
||||||
|
- scraped `55`
|
||||||
|
- failed `0`
|
||||||
|
- new prices `41`
|
||||||
|
- stock observations `40`
|
||||||
|
- specs verified `55`
|
||||||
|
|
||||||
|
Build:
|
||||||
|
|
||||||
|
- `pnpm -C packages/scraper build` passed on Erik
|
||||||
|
|
||||||
|
## FS.com Final State
|
||||||
|
|
||||||
|
- total rows: `383`
|
||||||
|
- price verified: `379`
|
||||||
|
- image verified: `374`
|
||||||
|
- details verified: `373`
|
||||||
|
- price+image+details: `373`
|
||||||
|
- fully verified: `205`
|
||||||
|
- missing URL: `2`
|
||||||
|
- missing image URL: `9`
|
||||||
|
- missing reach label: `4`
|
||||||
|
- missing fiber type: `9`
|
||||||
|
- HTML product-like rows: `373`
|
||||||
|
- HTML product-like complete: `371`
|
||||||
|
- no-url rows: `2`
|
||||||
|
- category rows: `4`
|
||||||
|
|
||||||
|
Remaining no-url rows:
|
||||||
|
|
||||||
|
- `Change`
|
||||||
|
- `FS-229461`
|
||||||
|
|
||||||
|
TIP health after run:
|
||||||
|
|
||||||
|
- status: `healthy`
|
||||||
|
- load status: `ok`
|
||||||
|
- memory used: `13%`
|
||||||
|
- global image verified: `10711`
|
||||||
|
- global details verified: `9929`
|
||||||
|
- global fully verified: `8526`
|
||||||
|
|
||||||
|
## Training Pool
|
||||||
|
|
||||||
|
Pushed:
|
||||||
|
|
||||||
|
- `4d9a11c crawl: add fscom url discovery learning record`
|
||||||
|
|
||||||
|
## Next
|
||||||
|
|
||||||
|
Do not claim FS.com is 100% complete yet. Remaining work:
|
||||||
|
|
||||||
|
- classify `Change`
|
||||||
|
- retry or classify `FS-229461`
|
||||||
|
- classify 4 category rows
|
||||||
|
- close 9 image/fiber gaps
|
||||||
|
- then move to next high-value competitor with the same bounded Crawlee pattern
|
||||||
Loading…
x
Reference in New Issue
Block a user