Skip to content

[WIP] Attempt Wikidata POIs#580

Draft
migurski wants to merge 11 commits intomigurski/continue-overturefrom
migurski/attempt-wikidata-pois
Draft

[WIP] Attempt Wikidata POIs#580
migurski wants to merge 11 commits intomigurski/continue-overturefrom
migurski/attempt-wikidata-pois

Conversation

@migurski
Copy link
Collaborator

@migurski migurski commented Mar 13, 2026

migurski and others added 10 commits March 12, 2026 15:07
…verture data

Summarize your findings up to this point in WIKIDATA.md and commit it

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ndings

Summarize this exploration into WIKIDATA.md and commit

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…istic wins

Compare and contrast your two proposed disambiguation approaches; Try that combined approach

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…untime lookup chain

Update WIKIDATA.md with these findings.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Propose a script that will generate a fresh copy of wikidata-website-qid.csv.gz when it is run on a schedule; yes, and if it does update WIKIDATA.md and commit both

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hold on, let's have scripts live here in tiles/ and resulting data live under data/sources/ with others

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add WebsiteQidDb: domain→QID lookup parsed from a gzipped CSV
(wikidata-website-qid-2026-03.csv.gz). Overture places features have no
native wikidata field, but often carry websites URLs. This enables a
two-hop lookup: websites[0] → domain → Q-ID → QRank score → min_zoom.

- WebsiteQidDb.java: HashMap<String,Long> backed, fromCsv uses
  lastIndexOf(',') to handle domain values containing commas; getQid()
  strips protocol/www/path before lookup
- Basemap.java: download + load websiteQidDb after qrankDb; pass to Pois
- Pois.java: add websiteQidDb field; fallback website→QID lookup in
  processOverture when wikidata tag is absent; add zoo/college/museum
  qrankGrading entries; recalibrate aerodrome/university thresholds so
  Oakland Airport→zoom 11, Oakland Zoo→zoom 12, UCB→zoom 13, OMCA→zoom 14
- Tests: WebsiteQidDbTest (9 tests), 4 new PoisOvertureTest cases with
  real Overture UUIDs (f66024a2 airport, a74a40ae zoo, 67e4f788 UCB,
  474b271e OMCA), LayerTest fixture expanded with all four Q-IDs

Prompt: "Implement the following plan: WebsiteQidDb + QRank-based
Overture POI Zoom [...] when you add unit tests concerning Overture
features, always include their full UUID so we can trace them back to
the original dataset [...] just use CLI duckdb, we already have it"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Spotless reformatted the markdown table during make lint; committed
separately since it was missed from the previous commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ence

Two guards prevent brand websites from inflating POI zoom levels:

1. Category allowlist: only apply website→QID when basic_category is an
   institution-level feature (airport, zoo, museum, college_university,
   etc.). Excludes air_transport_facility_service, travel_service,
   transportation_location, etc. where the website resolves to a brand
   entity (e.g. jetblue.com → Q161086 JetBlue Airways) rather than the
   specific place.

2. Confidence threshold (0.9): low-confidence features are often brand
   counters or services miscategorised as the institution. Real airports,
   zoos, etc. cluster at 0.90+; junk like JetBlue-as-airport appears at
   0.32.

Tests: websiteQid_ineligibleCategory_noEarlyZoom (category guard) and
websiteQid_lowConfidence_noEarlyZoom (confidence guard), both using real
Overture UUID e67dea74 / 8b6a937e for JetBlue features at OAK.

Prompt: "Do option B [...] Comment about why they are eligible in the
code [...] and test [...] I still see JetBlue appearing at z12 or even
z11, why? [...] good yes and test"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drop features below confidence 0.65 (junk tier: ~127k features dominated
by real estate listings, beauty salons, ATMs from uncertain sources).
Within the remaining features, use confidence to break sort key ties so
higher-confidence POIs win label collision resolution at the same zoom.

Sort key: minZoom * 1000 - (int)(confidence * 100), so confidence=0.99
scores 99 points lower (higher priority) than confidence=0.65.

Tests updated: websiteQid_ineligibleCategory_dropped and
websiteQid_lowConfidence_dropped now correctly expect zero features.
kind_nationalPark_fromBasicCategory switched to Pinnacles National Park
(4d619bc0, confidence=0.917) since the previous Alcatraz fixture
(814b8a78, confidence=0.639) falls below the new cutoff.

Prompt: "Let's bring more Overture confidence into POI rendering: make
higher-confidence POIs higher rendering priority, and simply omit ones
below 0.65 (junk tier)"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@migurski migurski self-assigned this Mar 13, 2026
@migurski migurski changed the title Attempt Wikidata POIs [WIP] Attempt Wikidata POIs Mar 13, 2026
… conflicts

Kept HEAD (full WebsiteQidDb machinery) in Pois.java; the only conflict
was a trivial comment difference on the QRank block. In PoisTest.java,
kept HEAD's full test suite (both JetBlue drop tests + all four
website→QID tests) over the cherry-pick's slimmed-down version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant