2026-05-22DataMesh Consulting
22 May — Four new Step-0 extractors (ADB, UNGM, GeBIZ, GETS NZ) and a data-quality sweep
A focused coverage day. Four new Step-0 extractors landed — Asian Development Bank (via its public SearchStax Solr index), UNGM (the UN procurement portal), a full rewrite of GeBIZ Singapore, and GETS NZ. All four include detail-page enrichment, not just listing-level coverage. Alongside the extractor work, three correctness fixes in the data pipeline: the junk-text title filter was dropping legitimate "Selection of …" titles, the immediate-submit path was nulling out tenderType / procedureType / contact fields, and the home-feed sort by deadlineAt had a column-alias collision that nulled every deadline. TED site consolidation script also ran in prod.
Four extractors, one day
The pattern that's working: write a Step-0 extractor as the first integration of any new portal, get listing+detail coverage live, then come back later if the portal needs deeper Stage-1+ enrichment (e.g. JS-rendered detail pages, auth-walled documents). Step-0 = HTTP-only, no Playwright, fastest path to first tenders.
Asian Development Bank — SearchStax Solr
ADB exposes its procurement notices through a public SearchStax-backed Solr index. The query interface returns JSON with full notice metadata in a single round-trip.
- Listing —
?q=:&sort=publishedAt+desc&rows=100&start=N
- Detail enrichment — Solr returns most fields inline.
- Country/sector — ADB tags each notice with member
UNGM — UN procurement portal
UNGM (UN Global Marketplace) consolidates procurement across UN agencies — UNDP, UNICEF, WFP, OCHA, etc. The public search page renders server-side HTML; per-notice detail pages have structured tables.
- Listing — paginated HTML, parsed with cheerio. Each
- Detail enrichment — agency-specific table layout, so
- First scrape — ~3,200 active notices across 18
GeBIZ — full rewrite, paginated detail
GeBIZ (Singapore Government Electronic Business) had a listing-only extractor from January that was missing the deep-link detail pages. The rewrite handles:
- Pagination across the 30-day notices window.
- Per-notice detail fetch (description, evaluation type,
- Singapore-specific procurement classifications mapped to
About 850 active notices, ~95% with detail enrichment after the rewrite.
GETS NZ — New Zealand Government Electronic Tenders Service
GETS exposes a paginated listing with a per-tender detail URL. Step-0 walks the listing, fetches each detail page, parses the notice metadata table.
- ~1,400 active tenders on first scrape.
- NZ-specific category codes (UNSPSC) preserved alongside
- Detail pages occasionally redirect to legacy URL formats;
Pipeline correctness fixes
Junk-text filter was dropping "Selection of …" titles
The junk-text filter (introduced 2026-05-09 to remove
boilerplate and 404 pages) had a regex that matched on
"Selection" as a standalone word — which caught all
tenders titled "Selection of Consultancy Services for X"
(common across multilateral portals). The filter is now
tightened to match ^Select\s*\.\.\. and similar truncation
patterns, leaving real titles intact. Backfill ran across
the last 30 days of ingests and unhid ~440 falsely-dropped
tenders.
Immediate-submit path was nulling rich fields
POST /v1/scrape/tenders/immediate (the path Hermes uses
to submit Step-0 results synchronously) was constructing
the upsert payload with only a subset of Tender fields,
leaving tenderType, procedureType, contactEmail,
contactPhone, and the value-range fields as null on
every immediate-submit tender. The async ingest path didn't
have this bug because it built the upsert from the full
Hermes payload. Fixed; the next scrape cycle will repopulate
these fields on the affected rows.
t.deadline alias nulled every deadlineAt
The Matches feed sort-by-deadline query used t.deadline
as a SELECT alias in a raw SQL statement. Postgres
returned that as the column name in the row payload, so
the Prisma client (expecting deadlineAt) wrote null.
Effect: every tender in the matches feed had a null
deadlineAt from the iOS app's perspective, so "Urgent"
(daysUntilDeadline ≤ 5) never triggered. Fixed; the alias
is now "deadlineAt" matching the model column.
NUTS region codes from delivery addresses
OCDS-compliant feeds (TED, several national portals) embed
the delivery address as a structured block with a NUTS
code. The Hermes OCDS parser now extracts that code into
tender.nutsCodes[], which lets the country-filter logic
distinguish between, e.g., a Berlin tender (DE-3) and a
Bavarian one (DE-2) on the same TED notice.
TED site consolidation — ran in prod
The TED site consolidation script (added 2026-05-15) merges the historical TED site rows that had forked across canonical-URL variants. Ran in prod tonight after dry-run validation. 9 TED site rows merged into 1; ~5,700 historical tenders re-attributed to the canonical site without losing their original sourceId mapping.
What's next
- The new extractors will go through their first full
- MERX Canada and AusTender are queued for tomorrow.