59 lines
1.5 KiB
Markdown
59 lines
1.5 KiB
Markdown
# TIP Crawlee Runtime
|
|
|
|
## Decision
|
|
|
|
TIP standardizes on Crawlee as the crawler runtime.
|
|
|
|
- Production TypeScript path: `packages/scraper` with `apify/crawlee` and Playwright.
|
|
- Optional Python worker path: `packages/crawlee-python` with `apify/crawlee-python`.
|
|
|
|
## TypeScript Core
|
|
|
|
The TypeScript scraper remains the canonical production path because TIP already
|
|
uses it for DB writes, price observations, stock observations, image verification
|
|
and detail verification.
|
|
|
|
Useful FS.com commands:
|
|
|
|
```bash
|
|
pnpm -C packages/scraper run scrape:fs:db-detail
|
|
pnpm -C packages/scraper run scrape:fs:url-discovery
|
|
```
|
|
|
|
Erik safety defaults:
|
|
|
|
- keep FS.com at browser concurrency `1`
|
|
- use bounded run caps
|
|
- treat no-text and max-retry URLs as retry/classification classes
|
|
- keep Crawlee storage isolated with `makeCrawleeConfig(...)`
|
|
|
|
## Python Worker
|
|
|
|
The Python worker is optional and should run first on Pi/Proxmox/residential
|
|
nodes. It writes JSONL evidence and does not write directly into TIP DB.
|
|
|
|
Install:
|
|
|
|
```bash
|
|
cd packages/crawlee-python
|
|
python3 -m venv .venv
|
|
. .venv/bin/activate
|
|
python -m pip install -U pip
|
|
python -m pip install -e ".[beautifulsoup]"
|
|
```
|
|
|
|
Smoke:
|
|
|
|
```bash
|
|
python -m tip_crawlee_worker \
|
|
--mode beautifulsoup \
|
|
--url https://crawlee.dev \
|
|
--out /tmp/tip-crawlee-python-smoke.jsonl \
|
|
--max-requests 1
|
|
```
|
|
|
|
## Training Pool
|
|
|
|
Every crawler result, failure class, parser lesson and runtime safety lesson
|
|
should be written to the TIPLLM training pool and synced through `sync/`.
|