- Add global unhandledRejection handler in scheduler daemon to swallow
Crawlee's benign post-run ENOENT lock-file races (prevents process.exit(1))
- Add SKIP_FS_SCRAPER env var: skip FS.com worker on Erik where Cloudflare
WAF blocks datacenter IPs (Mac launchd handles FS.com from residential IP)
- Remove FS.COM from health monitor EXPECTED_VENDORS (skipped on Erik)
- Health monitor: extend pg-boss lookup from 12h → 26h, add completed-job
map; if job ran OK in last 26h + vendor has historical prices → mark
STABLE instead of CRITICAL (fixes ATGBICS/Fluxlight hash-dedup false positives)
- Install Playwright Chromium on Erik (fixes ATGBICS BrowserLaunchError)
- Create missing Crawlee storage dirs on Erik (storage-fs-phase1/2,
storage-ebay-transceivers) to prevent ENOENT on first Crawlee run
Pattern 1 (href→aria-label) finds 127 navigation links on GBICS BigCommerce
pages — none contain GBP prices. Pattern 2 (aria-label→href) correctly
finds 16-30 product links per category page with £XX.XX prices in aria-labels.
The fallback from P1 to P2 now triggers when P1 finds results but none
contain '£', rather than only when P1 finds 0 total results.
Previous logic fired an alert whenever prices_6h=0, even when prices
were genuinely stable (content hash dedup prevents duplicate inserts).
This caused Flexoptix, ATGBICS and others to trigger alerts every 3h
despite their scrapers running successfully.
New logic:
🔴 CRITICAL: last price > 7 days (genuine failure)
🟡 WARNING: last price 48h–7 days (possibly stale)
✅ STABLE: last price ≤48h, 0 new (prices unchanged, scraper OK)
Also shows pg-boss job state/time alongside each vendor for faster
root-cause diagnosis. Trimmed EXPECTED_VENDORS to vendors with actual
scraper implementations (removed never-scraped placeholders).
After PlaywrightCrawler.run() resolves, Crawlee's internal task loop
schedules one final _isTaskReadyFunction call that tries to read a
request queue .json file already cleaned up during processing. This
ENOENT fires as an unhandledRejection and calls process.exit(1),
aborting Phase 2 before prices are written to the database.
Added a targeted unhandledRejection handler in the require.main block
that swallows ENOENT errors from request_queues paths (benign Crawlee
cleanup race) while re-raising all other rejections.
- Add makeCrawleeConfig isolation to CheerioCrawler instances
- Switch from named persistent RequestQueue to ephemeral null queues:
named queues retain 'handled' state and skip all URLs on re-runs,
causing 0 observations on every run after the first.
- Applies to both enrichSwitchFromEbay and enrichTransceiversFromEbay.
- Add utils/crawlee-config.ts: makeCrawleeConfig(name) returns a
Crawlee Configuration with isolated localDataDirectory per scraper.
Uses storageClientOptions (not global CRAWLEE_STORAGE_DIR) so
concurrent pg-boss workers in the same process don't race on
the shared env var.
- Apply makeCrawleeConfig to all 6 Crawlee-based scrapers:
optcore (PlaywrightCrawler), atgbics (PlaywrightCrawler),
community-issues (CheerioCrawler + RequestQueue),
edgecore (CheerioCrawler), ufispace (CheerioCrawler),
market-intelligence (CheerioCrawler).
- scheduler.ts: add withIsolatedStorage for optcore and market-intel
workers (was missing, caused storage-fs path bleed from fs scraper).
- ebay-enricher.ts: fix vendor type 'marketplace' -> 'reseller' to
satisfy vendors_type_check constraint
['manufacturer','distributor','oem','reseller','compatible'].
FiberMall:
- Correct /store-XXXXX-name.htm category URLs (was /c/xxx/ → HTTP 404)
- Parser: split on new_proList_mainListLi, price from data-price on
currency_price span — fix 0.00 false-match from SKU variant items
- Also scrape SKU brand variant links from .sku_item divs
- Result: 3,410 prices now in DB (was 0)
Flexoptix:
- Fix extractPrice regex for EUR thousand-separator format
(2,921.60 EUR was parsed as 2 EUR)
- Add OSFP224 / 1.6T search queries (4 new, form factor was missing)
- Fix O.138HG2.C.05 stale price 3009.60→2921.60 EUR
Schema: competitor_verified + competitor_verified_at columns
added via ALTER TABLE (were referenced in code but missing in DB)
CHANGELOG: added 6 entries for 2026-04-12
- New scrapers: fibermall.ts (WooCommerce), vcelink.ts (Shopify), opticsbay.ts (WooCommerce)
- QSFPTEK rewritten to use /mall/commodity/list API (old OpenCart /c/*.html paths gone 404)
- New: attribute-based filtering by data rate (1G/10G/25G/40G/100G/200G/400G/800G)
- Scrapes HTML fragments, extracts US$ prices and product URLs
- scheduler.ts: +3 queues/schedules/workers (fibermall, vcelink, opticsbay) → 61 total workers
- index-pi.ts: Pi fleet picks up all 3 new scrapers
Crawlee's SessionPool throws 'Could not find SDK_SESSION_POOL_STATE.json'
when initializing against a freshly-created isolated storage dir.
Setting CRAWLEE_PURGE_ON_START=1 tells Crawlee to start fresh instead
of trying to load non-existent session state — fixes FS.com and ATGBICS
crashes at the start of every 2h cycle after the dirs were cleaned up.
ProLabs uses B2B quote model - prices require reseller account and are
not shown publicly (schema.org always shows price=0.00). Fighting
CloudFront WAF with Firefox automation is pointless.
New approach:
- Sitemap-driven: downloads all 14 sitemaps to collect product URLs
- fetch-based: curl-compatible HTTP requests bypass CloudFront TLS detection
- catalog-only: writes part numbers + specs to transceivers table
- Rate-limited: 300ms between requests (~3 req/sec)
- No proxy needed: Pi nodes no longer consumed for ProLabs
Remove boss.work() registrations for lightweight fetch/cheerio scrapers
from Erik's scheduler. Pis are now the SOLE consumers of these queues:
fluxlight, gbics, optcore, champion-one, sfpcables, blueoptics, fiber24,
tscom, skylane, ascentoptics, gaotek, smartoptics, hubersuhner, news,
market-intel.
Routes requests through CT130/131/132 proxy pool (192.168.178.77/76/74:1080)
when PROXY_URLS env var is set. Uses ProxyConfiguration from crawlee for
PlaywrightCrawler scrapers and socks-proxy-agent for fetch-based scrapers.
EUR prices scraped verbatim from de.fs.com — no conversion needed.
USD derivation (EUR→USD) happens downstream, not EUR←USD.
Fixes price discrepancy: TIP showed USD 999×0.92=EUR 866 vs real €948 on de.fs.com.
Root cause of fake prices (e.g. 1.30 for 800G OSFP):
- parsePrice accepted any bare number without currency symbol
- Could misread stock counts, page numbers, or CSS values as prices
- Also picked the first number, not the main price
Fix:
- Require explicit currency symbol or decimal format (1234.56)
- Use the LARGEST number found in the price string
- Returns price=0 (rejected) when no valid price pattern found
Blog engine (fo-blog-pipeline.ts):
- Add STEP8b_REDUCTION: cuts article 25-35%, removes repeated concepts
- Add STEP8c_STYLE_LOCK: enforces tone consistency, fixes scope/OPM confusion,
removes inline SKUs from article flow
- Add Gold Standard 3 to calibration (Style B troubleshooting example 2026-04-04)
- Pipeline now 12 steps (was 10), version bumped to v4-reduction-stylelock
blog.ts:
- Wire STEP8b and STEP8c into pipeline between Kill-AI-Tone and QA Check
- Update progress tracking to 12 total steps
- Update pipeline_version to 'v4-reduction-stylelock'
flexoptix-catalog.ts:
- Fix contentHash call: pass object directly, not JSON.stringify(object)
db.ts:
- price_verified=true set in content_hash early-return path (no new observation)
- image_verified=true auto-set in findOrCreateScrapedTransceiver on INSERT/UPDATE
- findOrCreateScrapedTransceiver now sets image_verified=true when writing image_url
- upsertPriceObservation now sets price_verified=true on the transceiver after inserting price
- Both INSERT and UPDATE paths covered for image_verified sync
- Eliminates need for manual backfill after scraper runs
PostgreSQL max_connections was being exceeded (100/100).
- Limit pg-boss internal pool to 4 connections
- Added idle_in_transaction_session_timeout=30s to PostgreSQL config
- Already raised max_connections to 300 (container config)
System now stable at ~98/300 connections
Complete Pi scraper entry point covering all pricing, catalog, compat,
intelligence and prediction signal scrapers. Includes 5 new form-factor
coverage scrapers (comms-express, router-switch, multimode-inc,
optictransceiver, wiitek). Erik runs only API+DB, all scraping on Pis.
Add Comms-Express, Router-Switch.com, Multimode Inc, OpticTransceiver.com,
and Wiitek scrapers covering CFP2-DCO, CFP4, OSFP224, QSFP112, CXP, GBIC,
XENPAK, CSFP, SFP-DD, SFP56, QSFP56 and other previously-uncovered form
factors. Each scheduled every 8h. Worker registrations added to scheduler.
Also export db alias in utils/db.ts to fix eBay enricher + community scrapers
crashing with 'Cannot read properties of undefined (reading query)'.
- downloadDocuments(): fetches PDFs from product_documents and documents tables
using curl, organises into switches/ transceivers/ whitepapers/ other/ subdirs
- Integrated into runNightlyNasSync() — runs after JSON exports
- rsync incremental — only new/changed files transferred
- NAS dir structure: /volume1/tip-data/datasheets/{switches,transceivers,whitepapers,other}
- max-filesize 50MB guard per file
- utils/logger.ts: minimal console-based logger (debug/info/warn/error)
used by community-issues and ebay-enricher scrapers
- scripts/pi-scraper-setup.sh: step 7 adds optional WireGuard setup
(pass WG_PRIVKEY + WG_ADDR env vars) — connects Pi to Erik for DB access
auto-detects dead ethernet and routes WG traffic via working interface
New scrapers (8):
- BlueOptics (EUR, every 4h)
- ShopFiber24 (EUR, every 4h)
- T&S Communication (USD, every 4h)
- SmartOptics (catalog, every 8h)
- HUBER+SUHNER (catalog, every 8h)
- Skylane Optics (USD, every 4h)
- AscentOptics (USD, every 4h)
- GAO Tek (USD, every 4h)
Scheduler: nightly window → 24/7 continuous (42 jobs total)
- Playwright scrapers: every 8h (FS.com, 10Gtek, ATGBICS, ProLabs)
- Fetch/Cheerio: every 4h (11 lightweight vendors)
- Flexoptix catalog: every 2h (primary price source)
- eBay enrichment: every 6h
- Compatibility matrices: every 12h
- Compute jobs: every 4h
Pi fleet: scripts/pi-scraper-setup.sh for one-command Pi node setup
Remove orphan schedules (addon/naddod/qsfptek) that had no registered workers.
Pre-create request_queues/default, datasets/default, key_value_stores/default
before each scraper run to avoid ENOENT when Crawlee tries to write lock files.
Previously missing from scheduler:
- Champion ONE, Fluxlight, GBICs, SFPCables pricing
- Juniper HCT, SONiC HCL, Ufispace, Edgecore compatibility
- Flexoptix supported vendors
- Switch assets enrichment
Full nightly sequence now covers every scraper in the fleet.
All jobs staggered with 15-30 min gaps to respect vendor rate limits.