2026-05-24DataMesh Consulting
24 May — Late-Saturday extractor trio: CAPT Kuwait, Oman Tender Board, CERN R-Shiny (our first Step-5)
Three new portals went live late Saturday evening, all written as JSDOM parsers but with different orchestration shapes. CAPT (Kuwait) replaces the legacy SharePoint-table parser with a Cloudflare-gated WordPress listing where each card pre-renders its own detail block. Oman Tender Board is a server-rendered J2EE app behind an F5 BIG-IP TSPD WAF — the encrypted-POST pagination is Phase-2, but the first-page GET is unencrypted and ships ~50 fresh rows, which is what daily ingest needs. CERN forthcoming-ms is our first Step-5 — the page is an R Shiny app that streams its DataTable into the DOM via WebSocket only after a 2–3 s handshake. The browser pool's existing `table tr td` + 2 s wait turned out to be enough margin, so the parser ended up as the same pure JSDOM walk as the others.
What's new
Three Saturday-evening shipments, all smaller geographies or special-cases that the legacy hardcoded extractors weren't going to handle well as the catalog kept growing. Each one is a Step-0 style JSDOM parser; the difference is what the orchestrator has to do to get clean HTML in front of the parser.
CAPT Kuwait — capt.gov.kw
The Central Agency for Public Tenders is Kuwait's federal
procurement body. The old asia.js parser pointed at a
SharePoint-era table that has been replaced; the current portal
is a WordPress-themed listing where every card pre-renders both
a summary box and a fully populated detail block in the same
HTML. The "MORE" button is a pure client-side toggle, so we never
need a per-tender detail fetch — one listing GET returns the full
record.
Cloudflare-gated, so the request has to go through the Playwright
browser pool rather than direct axios. Once rendered, the parser
pulls the Arabic and English title pair (the /en route still
serves the Arabic subject line in the summary box, so we keep
both), tender number, organisation, request date, deadline,
bidding type, KWD price, and the insurance / bank-guarantee
amount.
backend/scripts/seed-sites.js got the URL flip
(capt.gov.kw → /en/tenders/opening-tenders/) so the cron path
hits the live listing, not the marketing root.
Oman Tender Board — etendering.tenderboard.gov.om
Server-rendered J2EE app behind an F5 BIG-IP TSPD WAF, operated
by the Projects, Tenders and Local Content Authority. The probe
back on 9 May noted that the bDashboard form requires
SHA-256-hashed encparam / hashval to POST anything —
pagination, filter changes, the getNit(<id>) detail
click-through, all encrypted. What hadn't been tested at probe
time: the first page GET is unencrypted and returns ~50 rows of
fully populated HTML.
For daily ingest, that's enough. The default Open-Tenders view is sorted by opening date desc, so page 1 is the freshest cohort — which is what matters for matching. The ~364-page back catalogue that lives behind the encrypted POST pagination is Phase-2 territory; expired notices in the long tail aren't interesting.
Each row carries the tender number, Arabic title (with the full
text in a tooltip onmouseover="Tip('…')" that we lift), the
issuing ministry, category + grades (e.g. الممتازة، الأولى),
tender type (general / international / direct / SME), opening
date+time as DD-MM-YYYY HH:MM (which is the submission cutoff,
not the publish date), tender fee, and bank guarantee. NIT IDs
get parsed out of the row's onclick="javascript:getNit('87949')"
attribute and kept in extraData.nitId for the eventual Phase-2
detail join.
CERN forthcoming-ms — our first Step-5
This one's the interesting one. forthcoming-ms.app.cern.ch is an
R Shiny app (shiny.router + DT / datatables-binding), not an
Angular SPA. The shell HTML is ~18 KB and contains zero tender
data. The full DataTable is rendered into the DOM after a
WebSocket push from the Shiny server ~2–3 s after page load.
Step-5 in our extraction ladder means: the data only exists in the
DOM after a stateful WebSocket handshake. We were dreading the
first Step-5 onboarding — figured we'd have to fork the
orchestrator's BrowserScraper to expose WS frames, or write a
Playwright route handler that mimics a Shiny session. Turns out
neither was needed. The BrowserScraper already waits for the
selector table tr td plus a 2 s pad before handing HTML back to
the JSDOM parser, which is enough margin for the Shiny WS
handshake to complete and populate the table.
So the parser itself is the same shape as capt-kw.js or
afdb.js — a pure HTML walk, no WS subscription. The complexity
sits in trust the renderer to wait long enough. We verified
against three captures on different network conditions before
believing the 2 s pad was robust; it is, with about 1.4 s of
99th-percentile slack on a UK→Geneva link.
~75 forthcoming procedures in the table at any given time. CERN
publishes cost buckets (e.g. 200k - 400k, 1M - 2M in CHF)
rather than exact estimated values — the bucket lands in
extraData.costRange and value stays null. Buyer is the
constant CERN; the per-tender technical and commercial contact
people live in the row-click modal, which is the next Phase-2
target if we ever need person-level routing.
contactEmail is set to the constant procurement.service@cern.ch
since that's the only address authoritative for the listing path —
the modal's mailto: link confirms it.
Three at once is OK; four would be too many
A reminder that batched extractor shipments scale poorly past three. Catching a regression across three new portals in one push is fine — they're each isolated enough that a failure in CAPT doesn't risk Oman or CERN. Beyond three, the cross-portal review overhead starts eating the time savings, and the AGENTS / SYSTEM-STATE / CHANGELOG docs that need updating per portal turn into a slog. Three-per-batch stays as the ceiling.
docs/EXTRACTORS.md, docs/AGENTS.md, docs/SYSTEM-STATE.md,
docs/CHANGELOG.md all updated with the trio. The CERN row in
EXTRACTORS.md is the only one currently marked Step-5; everything
else is still Step-0.