Skip to main content
All insights

2026-05-24DataMesh Consulting

24 May — Late-Saturday extractor trio: CAPT Kuwait, Oman Tender Board, CERN R-Shiny (our first Step-5)

Three new portals went live late Saturday evening, all written as JSDOM parsers but with different orchestration shapes. CAPT (Kuwait) replaces the legacy SharePoint-table parser with a Cloudflare-gated WordPress listing where each card pre-renders its own detail block. Oman Tender Board is a server-rendered J2EE app behind an F5 BIG-IP TSPD WAF — the encrypted-POST pagination is Phase-2, but the first-page GET is unencrypted and ships ~50 fresh rows, which is what daily ingest needs. CERN forthcoming-ms is our first Step-5 — the page is an R Shiny app that streams its DataTable into the DOM via WebSocket only after a 2–3 s handshake. The browser pool's existing `table tr td` + 2 s wait turned out to be enough margin, so the parser ended up as the same pure JSDOM walk as the others.

What's new

Three Saturday-evening shipments, all smaller geographies or special-cases that the legacy hardcoded extractors weren't going to handle well as the catalog kept growing. Each one is a Step-0 style JSDOM parser; the difference is what the orchestrator has to do to get clean HTML in front of the parser.

CAPT Kuwait — capt.gov.kw

The Central Agency for Public Tenders is Kuwait's federal procurement body. The old asia.js parser pointed at a SharePoint-era table that has been replaced; the current portal is a WordPress-themed listing where every card pre-renders both a summary box and a fully populated detail block in the same HTML. The "MORE" button is a pure client-side toggle, so we never need a per-tender detail fetch — one listing GET returns the full record.

Cloudflare-gated, so the request has to go through the Playwright browser pool rather than direct axios. Once rendered, the parser pulls the Arabic and English title pair (the /en route still serves the Arabic subject line in the summary box, so we keep both), tender number, organisation, request date, deadline, bidding type, KWD price, and the insurance / bank-guarantee amount.

backend/scripts/seed-sites.js got the URL flip (capt.gov.kw/en/tenders/opening-tenders/) so the cron path hits the live listing, not the marketing root.

Oman Tender Board — etendering.tenderboard.gov.om

Server-rendered J2EE app behind an F5 BIG-IP TSPD WAF, operated by the Projects, Tenders and Local Content Authority. The probe back on 9 May noted that the bDashboard form requires SHA-256-hashed encparam / hashval to POST anything — pagination, filter changes, the getNit(<id>) detail click-through, all encrypted. What hadn't been tested at probe time: the first page GET is unencrypted and returns ~50 rows of fully populated HTML.

For daily ingest, that's enough. The default Open-Tenders view is sorted by opening date desc, so page 1 is the freshest cohort — which is what matters for matching. The ~364-page back catalogue that lives behind the encrypted POST pagination is Phase-2 territory; expired notices in the long tail aren't interesting.

Each row carries the tender number, Arabic title (with the full text in a tooltip onmouseover="Tip('…')" that we lift), the issuing ministry, category + grades (e.g. الممتازة، الأولى), tender type (general / international / direct / SME), opening date+time as DD-MM-YYYY HH:MM (which is the submission cutoff, not the publish date), tender fee, and bank guarantee. NIT IDs get parsed out of the row's onclick="javascript:getNit('87949')" attribute and kept in extraData.nitId for the eventual Phase-2 detail join.

CERN forthcoming-ms — our first Step-5

This one's the interesting one. forthcoming-ms.app.cern.ch is an R Shiny app (shiny.router + DT / datatables-binding), not an Angular SPA. The shell HTML is ~18 KB and contains zero tender data. The full DataTable is rendered into the DOM after a WebSocket push from the Shiny server ~2–3 s after page load.

Step-5 in our extraction ladder means: the data only exists in the DOM after a stateful WebSocket handshake. We were dreading the first Step-5 onboarding — figured we'd have to fork the orchestrator's BrowserScraper to expose WS frames, or write a Playwright route handler that mimics a Shiny session. Turns out neither was needed. The BrowserScraper already waits for the selector table tr td plus a 2 s pad before handing HTML back to the JSDOM parser, which is enough margin for the Shiny WS handshake to complete and populate the table.

So the parser itself is the same shape as capt-kw.js or afdb.js — a pure HTML walk, no WS subscription. The complexity sits in trust the renderer to wait long enough. We verified against three captures on different network conditions before believing the 2 s pad was robust; it is, with about 1.4 s of 99th-percentile slack on a UK→Geneva link.

~75 forthcoming procedures in the table at any given time. CERN publishes cost buckets (e.g. 200k - 400k, 1M - 2M in CHF) rather than exact estimated values — the bucket lands in extraData.costRange and value stays null. Buyer is the constant CERN; the per-tender technical and commercial contact people live in the row-click modal, which is the next Phase-2 target if we ever need person-level routing.

contactEmail is set to the constant procurement.service@cern.ch since that's the only address authoritative for the listing path — the modal's mailto: link confirms it.

Three at once is OK; four would be too many

A reminder that batched extractor shipments scale poorly past three. Catching a regression across three new portals in one push is fine — they're each isolated enough that a failure in CAPT doesn't risk Oman or CERN. Beyond three, the cross-portal review overhead starts eating the time savings, and the AGENTS / SYSTEM-STATE / CHANGELOG docs that need updating per portal turn into a slog. Three-per-batch stays as the ceiling.

docs/EXTRACTORS.md, docs/AGENTS.md, docs/SYSTEM-STATE.md, docs/CHANGELOG.md all updated with the trio. The CERN row in EXTRACTORS.md is the only one currently marked Step-5; everything else is still Step-0.

Methodology: drawn from the week ending 2026-05-24 tender corpus. Tender data sourced from public procurement portals worldwide; see our methodology for the extraction pipeline.