2026-05-26DataMesh Consulting
26 May — CEJN Phase-2 detail enrichment + five more extractors land in WIP staging
CEJN (Catalonia's contracting platform) gained opt-in detail enrichment today — `enrichDetails:true` runs a per-row fetch against `/api/cadocuments/GetTenderById` to pick up description text, estimated budget, pipe-delimited CPVs, procurement-officer email/phone, and eight category/procedure flags that don't appear in the listing JSON. Concurrency 4, circuit-broken after 6 consecutive failures so a bad afternoon at the source doesn't take down the whole ingest. Alongside that, five new extractors landed in the repo but haven't been wired into the URL/API dispatch tables yet — buy-nsw (NSW state procurement), dgmarket (development-bank multi-source aggregator), koneps-kr (Korea g2b), philgeps-ph (Philippines), and sam-gov (US federal GSA). Each ships with a smoke-test harness; the wiring step is intentionally deferred so we can verify capture quality side-by-side against the legacy paths before they go live. RTA Dubai and Qatar Rail Procurement deactivated in the site list — neither has produced a valid notice in three weeks.
CEJN Phase-2 enrichment — opt-in, circuit-broken
CEJN, the Plataforma de Serveis de Contractació Pública del
sector públic de Catalunya, has been shipping at Phase-1 since
it landed: listing-JSON only, with title / buyer /
publication date / deadline but no description, no value, no
CPVs. The listing endpoint returns roughly 18 KB per page; the
per-row detail at /api/cadocuments/GetTenderById?id=… returns
~6 KB of structured fields we were leaving on the floor.
Today's change adds opt-in enrichment behind an
enrichDetails:true extractor option. With it on, after the
listing parse completes we hit the detail endpoint per row with
concurrency 4 and pull:
notes→descriptionestimatedBudget→value(andcurrencyfromcurrencyCode)- pipe-delimited
cpvs→cpvCodes[], de-duplicated and
procurementOfficer.{email,phone,name}→contactEmail,
contactPhone, contactName
- Buyer-org metadata: department, body type, place of contract
- Eight boolean procurement flags
subjectToHarmonisedRegulation, electronicProcedure,
lotsAllowed, variantsAllowed, frameworkAgreement,
dynamicPurchasingSystem, eAuction, reservedForSME)
→ extraData.flags
Error-isolated per row: a 5xx or a missing-record on one tender doesn't fail the listing. A circuit breaker trips after six consecutive failures and shuts enrichment down for the rest of the run — the listing-only payload still ships, so we don't lose the day's notices because the detail API is having an afternoon.
Off by default for the cron path right now. We'll enable it on the next prod-push cycle once a full-listing smoke run confirms the field mapping holds across the long tail of notice types.
Five new extractors — built, smoke-tested, not yet wired
The branch had been carrying five new extractors as uncommitted work for a few days while the CERN / Oman / CAPT trio got the attention. They went in today:
- buy-nsw.js — Buy NSW, the New South Wales state
- dgmarket.js — DgMarket, a development-bank multi-source
- koneps-kr.js — KONEPS, Korea's national procurement portal
g2b.go.kr). The complement to the g2b probe work that PR
#80 set up earlier this week.
- philgeps-ph.js — PhilGEPS, the Philippines federal
- sam-gov.js — SAM.gov, the US federal General Services
Each ships with a scripts/test-<name>-extractor.js harness
that does an end-to-end fixture parse and asserts at least one
well-formed Tender object. None of them are wired into the
URL_EXTRACTORS / API_EXTRACTORS dispatch tables in
site-extractors/index.js yet — that step is deliberate. Before
we put them into the rotation we want to compare a few hundred
capture rows side-by-side against what the legacy hardcoded
paths produce for the same portals (where applicable), so any
field-mapping regressions show up before they hit live matches.
Wiring + activation will be the next push, probably split into
two batches — sam-gov + buy-nsw first (English-only,
well-instrumented, smaller surface area), then koneps-kr +
philgeps-ph + dgmarket in a second wave once the i18n + CPV
mapping on the multilingual three is cross-checked.
Etimad and ISDB — quiet enrichment
While the new files were going in, two existing extractors got quiet tightening:
- etimad.js (Saudi Arabia's eGov procurement portal) had
- isdb.js (Islamic Development Bank) had its listing parser
<span> rather than a flat <td>. The
old parser was silently skipping those rows.
Neither change is user-visible, but both reduce silent skips on the affected portals.
Site list cleanup — two deactivations
RTA Dubai and Qatar Rail Procurement have not produced a valid notice in three weeks despite the extractor returning 200s.
- RTA Dubai's procurement section migrated to a new portal
- Qatar Rail Procurement was folded into Qatar Energy's
Both have been flipped inactive in
backend/scripts/seed-sites.js so they stop showing up in the
operator dashboard's red-light list, and the Qatar Rail
URL_FIX entry was removed from backend/prisma/fix-tender-site-urls.ts
(the URL it was repointing to has been gone for weeks).
Adding the replacement portals is on the queue for next week.
Tooling
Smaller bits in the same commit, mostly for next-week ergonomics:
scripts/capt-kw-probe.js— listing-page probe for the CAPT
scripts/etimad-live-probe.js— F5 TSPD cookie + paginated
scripts/buy-nsw-full-scrape-prod.js— the AfDB-style
buy-nsw so the
wiring-day push doesn't need bespoke shell glue.
scripts/fix-tender-site-urls-prod.shgot thechmod +x
Fixtures (capt-kw-listing-{ar,en}.html, cejn-listing.json,
cejn-detail-115566.json, cern-{ws-samples.txt,xhr.json}) all
landed in scripts/__fixtures__/ so the smoke harnesses run
offline.