Skip to main content
All insights

2026-05-22DataMesh Consulting

22 May — Four new Step-0 extractors (ADB, UNGM, GeBIZ, GETS NZ) and a data-quality sweep

A focused coverage day. Four new Step-0 extractors landed — Asian Development Bank (via its public SearchStax Solr index), UNGM (the UN procurement portal), a full rewrite of GeBIZ Singapore, and GETS NZ. All four include detail-page enrichment, not just listing-level coverage. Alongside the extractor work, three correctness fixes in the data pipeline: the junk-text title filter was dropping legitimate "Selection of …" titles, the immediate-submit path was nulling out tenderType / procedureType / contact fields, and the home-feed sort by deadlineAt had a column-alias collision that nulled every deadline. TED site consolidation script also ran in prod.

Four extractors, one day

The pattern that's working: write a Step-0 extractor as the first integration of any new portal, get listing+detail coverage live, then come back later if the portal needs deeper Stage-1+ enrichment (e.g. JS-rendered detail pages, auth-walled documents). Step-0 = HTTP-only, no Playwright, fastest path to first tenders.

Asian Development Bank — SearchStax Solr

ADB exposes its procurement notices through a public SearchStax-backed Solr index. The query interface returns JSON with full notice metadata in a single round-trip.

  • Listing?q=:&sort=publishedAt+desc&rows=100&start=N
for paging. ~5,800 active notices on first scrape.
  • Detail enrichment — Solr returns most fields inline.
Description, CPV-equivalent codes, contact info, value range all parsed from a single response object.
  • Country/sector — ADB tags each notice with member
country and sector codes that we map to our canonical enum.

UNGM — UN procurement portal

UNGM (UN Global Marketplace) consolidates procurement across UN agencies — UNDP, UNICEF, WFP, OCHA, etc. The public search page renders server-side HTML; per-notice detail pages have structured tables.

  • Listing — paginated HTML, parsed with cheerio. Each
row gives the notice URL, agency, country, and deadline.
  • Detail enrichment — agency-specific table layout, so
the detail parser has a small per-agency dispatch (UNDP and UNICEF use distinct table structures).
  • First scrape — ~3,200 active notices across 18
agencies.

GeBIZ — full rewrite, paginated detail

GeBIZ (Singapore Government Electronic Business) had a listing-only extractor from January that was missing the deep-link detail pages. The rewrite handles:

  • Pagination across the 30-day notices window.
  • Per-notice detail fetch (description, evaluation type,
document list, contact officer).
  • Singapore-specific procurement classifications mapped to
the closest CPV equivalents.

About 850 active notices, ~95% with detail enrichment after the rewrite.

GETS NZ — New Zealand Government Electronic Tenders Service

GETS exposes a paginated listing with a per-tender detail URL. Step-0 walks the listing, fetches each detail page, parses the notice metadata table.

  • ~1,400 active tenders on first scrape.
  • NZ-specific category codes (UNSPSC) preserved alongside
the CPV mapping.
  • Detail pages occasionally redirect to legacy URL formats;
the extractor follows redirects up to depth 3 before giving up.

Pipeline correctness fixes

Junk-text filter was dropping "Selection of …" titles

The junk-text filter (introduced 2026-05-09 to remove boilerplate and 404 pages) had a regex that matched on "Selection" as a standalone word — which caught all tenders titled "Selection of Consultancy Services for X" (common across multilateral portals). The filter is now tightened to match ^Select\s*\.\.\. and similar truncation patterns, leaving real titles intact. Backfill ran across the last 30 days of ingests and unhid ~440 falsely-dropped tenders.

Immediate-submit path was nulling rich fields

POST /v1/scrape/tenders/immediate (the path Hermes uses to submit Step-0 results synchronously) was constructing the upsert payload with only a subset of Tender fields, leaving tenderType, procedureType, contactEmail, contactPhone, and the value-range fields as null on every immediate-submit tender. The async ingest path didn't have this bug because it built the upsert from the full Hermes payload. Fixed; the next scrape cycle will repopulate these fields on the affected rows.

t.deadline alias nulled every deadlineAt

The Matches feed sort-by-deadline query used t.deadline as a SELECT alias in a raw SQL statement. Postgres returned that as the column name in the row payload, so the Prisma client (expecting deadlineAt) wrote null. Effect: every tender in the matches feed had a null deadlineAt from the iOS app's perspective, so "Urgent" (daysUntilDeadline ≤ 5) never triggered. Fixed; the alias is now "deadlineAt" matching the model column.

NUTS region codes from delivery addresses

OCDS-compliant feeds (TED, several national portals) embed the delivery address as a structured block with a NUTS code. The Hermes OCDS parser now extracts that code into tender.nutsCodes[], which lets the country-filter logic distinguish between, e.g., a Berlin tender (DE-3) and a Bavarian one (DE-2) on the same TED notice.

TED site consolidation — ran in prod

The TED site consolidation script (added 2026-05-15) merges the historical TED site rows that had forked across canonical-URL variants. Ran in prod tonight after dry-run validation. 9 TED site rows merged into 1; ~5,700 historical tenders re-attributed to the canonical site without losing their original sourceId mapping.

What's next

  • The new extractors will go through their first full
scrape cycle overnight. Expect to see the first-day counts settle by morning as dedup catches near-duplicates.
  • MERX Canada and AusTender are queued for tomorrow.
Methodology: drawn from the week ending 2026-05-22 tender corpus. Tender data sourced from public procurement portals worldwide; see our methodology for the extraction pipeline.