Skip to main content
All insights

2026-05-26DataMesh Consulting

26 May — CEJN Phase-2 detail enrichment + five more extractors land in WIP staging

CEJN (Catalonia's contracting platform) gained opt-in detail enrichment today — `enrichDetails:true` runs a per-row fetch against `/api/cadocuments/GetTenderById` to pick up description text, estimated budget, pipe-delimited CPVs, procurement-officer email/phone, and eight category/procedure flags that don't appear in the listing JSON. Concurrency 4, circuit-broken after 6 consecutive failures so a bad afternoon at the source doesn't take down the whole ingest. Alongside that, five new extractors landed in the repo but haven't been wired into the URL/API dispatch tables yet — buy-nsw (NSW state procurement), dgmarket (development-bank multi-source aggregator), koneps-kr (Korea g2b), philgeps-ph (Philippines), and sam-gov (US federal GSA). Each ships with a smoke-test harness; the wiring step is intentionally deferred so we can verify capture quality side-by-side against the legacy paths before they go live. RTA Dubai and Qatar Rail Procurement deactivated in the site list — neither has produced a valid notice in three weeks.

CEJN Phase-2 enrichment — opt-in, circuit-broken

CEJN, the Plataforma de Serveis de Contractació Pública del sector públic de Catalunya, has been shipping at Phase-1 since it landed: listing-JSON only, with title / buyer / publication date / deadline but no description, no value, no CPVs. The listing endpoint returns roughly 18 KB per page; the per-row detail at /api/cadocuments/GetTenderById?id=… returns ~6 KB of structured fields we were leaving on the floor.

Today's change adds opt-in enrichment behind an enrichDetails:true extractor option. With it on, after the listing parse completes we hit the detail endpoint per row with concurrency 4 and pull:

  • notesdescription
  • estimatedBudgetvalue (and currency from currencyCode)
  • pipe-delimited cpvscpvCodes[], de-duplicated and
validated as 8-digit
  • procurementOfficer.{email,phone,name}contactEmail,
contactPhone, contactName
  • Buyer-org metadata: department, body type, place of contract
  • Eight boolean procurement flags
(subjectToHarmonisedRegulation, electronicProcedure, lotsAllowed, variantsAllowed, frameworkAgreement, dynamicPurchasingSystem, eAuction, reservedForSME) → extraData.flags

Error-isolated per row: a 5xx or a missing-record on one tender doesn't fail the listing. A circuit breaker trips after six consecutive failures and shuts enrichment down for the rest of the run — the listing-only payload still ships, so we don't lose the day's notices because the detail API is having an afternoon.

Off by default for the cron path right now. We'll enable it on the next prod-push cycle once a full-listing smoke run confirms the field mapping holds across the long tail of notice types.

Five new extractors — built, smoke-tested, not yet wired

The branch had been carrying five new extractors as uncommitted work for a few days while the CERN / Oman / CAPT trio got the attention. They went in today:

  • buy-nsw.js — Buy NSW, the New South Wales state
procurement portal. Public JSON behind a session cookie; Step-0 listing parser plus a per-notice detail fetch for the full description and contact block.
  • dgmarket.js — DgMarket, a development-bank multi-source
aggregator (AfDB, ADB, IADB, World Bank, AIIB, EBRD, IsDB). 726 lines because the per-bank routing logic lives in the same file; we may split it later if the routing tree gets hairy.
  • koneps-kr.js — KONEPS, Korea's national procurement portal
(g2b.go.kr). The complement to the g2b probe work that PR #80 set up earlier this week.
  • philgeps-ph.js — PhilGEPS, the Philippines federal
procurement portal.
  • sam-gov.js — SAM.gov, the US federal General Services
Administration's opportunities portal. The big one — the federal opps API is well-documented but rate-limited at 10 req/s with hourly caps, so the extractor's pagination is built around that ceiling and surfaces remaining-budget in its log lines.

Each ships with a scripts/test-<name>-extractor.js harness that does an end-to-end fixture parse and asserts at least one well-formed Tender object. None of them are wired into the URL_EXTRACTORS / API_EXTRACTORS dispatch tables in site-extractors/index.js yet — that step is deliberate. Before we put them into the rotation we want to compare a few hundred capture rows side-by-side against what the legacy hardcoded paths produce for the same portals (where applicable), so any field-mapping regressions show up before they hit live matches.

Wiring + activation will be the next push, probably split into two batches — sam-gov + buy-nsw first (English-only, well-instrumented, smaller surface area), then koneps-kr + philgeps-ph + dgmarket in a second wave once the i18n + CPV mapping on the multilingual three is cross-checked.

Etimad and ISDB — quiet enrichment

While the new files were going in, two existing extractors got quiet tightening:

  • etimad.js (Saudi Arabia's eGov procurement portal) had
its F5 TSPD cookie warm-up adjusted — the previous fixed timeout was occasionally racing the page render and producing an empty first listing fetch. Now waits on the cookie header presence rather than a wall-clock interval, so slow renders no longer drop the first page.
  • isdb.js (Islamic Development Bank) had its listing parser
refined to handle a category cell that the bank occasionally renders as a nested <span> rather than a flat <td>. The old parser was silently skipping those rows.

Neither change is user-visible, but both reduce silent skips on the affected portals.

Site list cleanup — two deactivations

RTA Dubai and Qatar Rail Procurement have not produced a valid notice in three weeks despite the extractor returning 200s.

  • RTA Dubai's procurement section migrated to a new portal
we haven't onboarded yet — the old URL still serves a page, but the listing block is empty.
  • Qatar Rail Procurement was folded into Qatar Energy's
portal; the standalone tendering site no longer resolves.

Both have been flipped inactive in backend/scripts/seed-sites.js so they stop showing up in the operator dashboard's red-light list, and the Qatar Rail URL_FIX entry was removed from backend/prisma/fix-tender-site-urls.ts (the URL it was repointing to has been gone for weeks).

Adding the replacement portals is on the queue for next week.

Tooling

Smaller bits in the same commit, mostly for next-week ergonomics:

  • scripts/capt-kw-probe.js — listing-page probe for the CAPT
Kuwait pre-rendered DOM, captured for the bilingual fixture pair that the extractor's tests run against.
  • scripts/etimad-live-probe.js — F5 TSPD cookie + paginated
JSON capture for live verification of the cookie-warm-up fix above.
  • scripts/buy-nsw-full-scrape-prod.js — the AfDB-style
two-phase prod push wrapper, pre-baked for buy-nsw so the wiring-day push doesn't need bespoke shell glue.
  • scripts/fix-tender-site-urls-prod.sh got the chmod +x
it had been missing since it was added.

Fixtures (capt-kw-listing-{ar,en}.html, cejn-listing.json, cejn-detail-115566.json, cern-{ws-samples.txt,xhr.json}) all landed in scripts/__fixtures__/ so the smoke harnesses run offline.

Methodology: drawn from the week ending 2026-05-26 tender corpus. Tender data sourced from public procurement portals worldwide; see our methodology for the extraction pipeline.