transceiver-db/docs/TIP_CRAWLEE_RUNTIME.md
2026-05-09 14:06:34 +02:00

1.5 KiB

TIP Crawlee Runtime

Decision

TIP standardizes on Crawlee as the crawler runtime.

  • Production TypeScript path: packages/scraper with apify/crawlee and Playwright.
  • Optional Python worker path: packages/crawlee-python with apify/crawlee-python.

TypeScript Core

The TypeScript scraper remains the canonical production path because TIP already uses it for DB writes, price observations, stock observations, image verification and detail verification.

Useful FS.com commands:

pnpm -C packages/scraper run scrape:fs:db-detail
pnpm -C packages/scraper run scrape:fs:url-discovery

Erik safety defaults:

  • keep FS.com at browser concurrency 1
  • use bounded run caps
  • treat no-text and max-retry URLs as retry/classification classes
  • keep Crawlee storage isolated with makeCrawleeConfig(...)

Python Worker

The Python worker is optional and should run first on Pi/Proxmox/residential nodes. It writes JSONL evidence and does not write directly into TIP DB.

Install:

cd packages/crawlee-python
python3 -m venv .venv
. .venv/bin/activate
python -m pip install -U pip
python -m pip install -e ".[beautifulsoup]"

Smoke:

python -m tip_crawlee_worker \
  --mode beautifulsoup \
  --url https://crawlee.dev \
  --out /tmp/tip-crawlee-python-smoke.jsonl \
  --max-requests 1

Training Pool

Every crawler result, failure class, parser lesson and runtime safety lesson should be written to the TIPLLM training pool and synced through sync/.