Harden Google Maps place extraction by michaelmwu · Pull Request #37 · 508-dev/gmaps-scraper

michaelmwu · 2026-05-10T08:32:06Z

Summary

reject first-person/review-style Google Maps descriptions, Google Maps UI action clusters, sponsored labels, and tab-strip text
keep description fallback selection from accepting Overview/Reviews/About tab containers when direct selectors are absent
preserve selected search-card panel lines so fallback detail parsing can still recover name/category/address/plus-code data
ignore owner/user payloads when extracting CIDs while still allowing legitimate numeric CID arrays
add regression coverage for description cleaning, panel-text fallback parsing, tab exclusion, photo URL validation, and CID parsing

Validation

uv run python3 -m unittest tests.test_place_scraper.PlaceScraperTests.test_clean_description_text_rejects_first_person_review_prose
uv run python3 -m unittest discover -s tests -p 'test_place_scraper.py'
PR checks are currently green on GitHub

coderabbitai · 2026-05-10T08:32:17Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Refined description selection and cleaning: expanded review-prose markers and UI-action-cluster detection, rewrote place-page and search-card extractors to select descriptionValue(), tightened photo selectors/URL normalization, simplified Python extraction/fallbacks, and updated tests and snapshots.

Changes

Place extraction & cleaning

Layer / File(s)	Summary
Description markers and UI-cluster detection `src/gmaps_scraper/place_scraper.py`	Expanded `_DESCRIPTION_REVIEW_PROSE_MARKERS`, added `_UI_ACTION_CLUSTER_RE` and `_looks_like_ui_action_cluster()`, and updated `_clean_description_text` and `_looks_like_search_results_label` to reject UI clusters and sponsored labels.
Place-page DOM extractor updates `src/gmaps_scraper/place_scraper.py`, `tests/test_place_scraper.py`	Place-page JS extractor computes element top/bottom boundaries, picks a description via `descriptionValue()`, adds elementTop/elementBottom helpers, expands main photo selectors, and updated JS extractor test expectations.
Search-result card description selection `src/gmaps_scraper/place_scraper.py`, `tests/test_place_scraper.py`	Replaced `findDescriptionLine` with `cardDescription` using an exclusion set and row filters (skip star/tel/website/directions/open/closed patterns); `search_result_description` is sourced from `cardDescription(...)` and tests updated; snapshot builder now preserves `panel_text`.
_build_place_details fallbacks & tests `src/gmaps_scraper/place_scraper.py`, `tests/test_place_scraper.py`	`combined_lines`/fallbacks now derive from `panel_text` only; `_extract_description` simplified to accept snapshot and call `_clean_description_text(snapshot.get("description"))`; multiple tests/snapshots updated (description injection, coords, limited_view, admission_prices, sponsored name rejection).
Photo selectors and URL normalization `src/gmaps_scraper/place_scraper.py`, `tests/test_place_scraper.py`	Expanded `mainPhotoUrl` selector lists; `_normalize_photo_url` rejects additional `lh*.googleusercontent.com/ggpht.com` hosts and requires `w{digits}-h{digits}` size patterns; tests accept a `gps-cs-s` URL variant and reject ad-thumbnail/unshaped-host style photo URLs.

Parser owner-payload guard

Layer / File(s)	Summary
_find_cid_in_value owner guard and test `src/gmaps_scraper/parser.py`, `tests/test_parser.py`	Early guard to treat list-shaped `ListOwner`-like payloads (with `photo_url` or `profile_id`) as non-CID values; unit test added to ensure no CID is extracted from owner metadata.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

508-dev/gmaps-scraper#28: Touches _normalize_photo_url and related photo-URL validation tests.
508-dev/gmaps-scraper#10: Modifies place-page extraction heuristics and description filtering patterns.
508-dev/gmaps-scraper#12: Related saved-list owner parsing changes; relevant to parser owner-payload handling.

Poem

🐰 I nibble through DOMs and tidy up the text,
I hop past starry rows and sponsored pretext.
I prune the photo links and guard the IDs,
Clean snippets hop out, neat as carrot-tide.
Tiny paws, bright nose — scraper joy amplified.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Harden Google Maps place extraction' is overly broad and generic, referring to hardening/robustness improvements but not conveying the main change (rejecting first-person review descriptions).	Consider a more specific title like 'Reject first-person review descriptions in place extraction' that clearly indicates the primary change.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/reject-ramen-review-description

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/gmaps_scraper/place_scraper.py`:
- Around line 139-143: The matching fails for curly apostrophes (e.g., “we’ve
ever had”) because the review-prose markers list contains straight-apostrophe
strings like "we've ever had"; before running the marker checks (the code that
iterates over strings such as "we've ever had", "we have ever had", "we've ever
had", "overrated", etc.) normalize the review text (or both the text and
markers) by replacing curly/apostrophe variants (e.g., \u2019, \u2018) with the
ASCII apostrophe (') and any similar typographic quotes so that "we’ve ever had"
matches "we've ever had".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b8399c54-fd11-40c5-9e46-9693ab22485f

📥 Commits

Reviewing files that changed from the base of the PR and between a7900da and 09ed5db.

📒 Files selected for processing (2)

src/gmaps_scraper/place_scraper.py
tests/test_place_scraper.py

coderabbitai · 2026-05-10T08:33:36Z

+    "great experience overall",
    "highly recommended",
    "overrated",
+    "we have ever had",
+    "we've ever had",


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize apostrophes before matching review-prose markers.

At Line 143, "we've ever had" won’t match curly-apostrophe text (we’ve ever had), so first-person review prose can still pass through.

Suggested fix

def _looks_like_description_review_prose(value: str) -> bool: if len(value.split()) < 12: return False - lowered = value.casefold() + lowered = value.casefold().replace("’", "'") return any(marker in lowered for marker in _DESCRIPTION_REVIEW_PROSE_MARKERS)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/gmaps_scraper/place_scraper.py` around lines 139 - 143, The matching fails for curly apostrophes (e.g., “we’ve ever had”) because the review-prose markers list contains straight-apostrophe strings like "we've ever had"; before running the marker checks (the code that iterates over strings such as "we've ever had", "we have ever had", "we've ever had", "overrated", etc.) normalize the review text (or both the text and markers) by replacing curly/apostrophe variants (e.g., \u2019, \u2018) with the ASCII apostrophe (') and any similar typographic quotes so that "we’ve ever had" matches "we've ever had".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 09ed5db706

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-10T08:34:02Z

    "highly recommended",
    "overrated",
+    "we have ever had",
+    "we've ever had",


Normalize curly apostrophes for review-prose markers

When the leaked review uses a typographic apostrophe, e.g. Best ramen we’ve ever had ..., this new marker will not match because _looks_like_description_review_prose() only casefold()s the text and does not replace ’ with ' like _looks_like_review_response_text() does. Since had is not one of the first-person experience verbs, a 12+ word review with we’ve ever had and no other marker can still be accepted as a place description, leaving the class of leak this change is meant to reject.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This pull request tightens description cleaning in place_scraper to prevent first-person / review-style prose from being treated as a place description, and adds a regression test for a ramen-class review snippet that previously leaked through.

Changes:

Extend the review-prose marker list to reject additional review-like phrases (e.g., “great experience overall”, “we’ve ever had”).
Add a unit test asserting that a ramen-class review-style description is rejected by _clean_description_text().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`src/gmaps_scraper/place_scraper.py`	Adds new review-prose substring markers used to reject leaked review-style descriptions.
`tests/test_place_scraper.py`	Adds a regression test ensuring a “Best ramen we’ve ever had…”-style snippet is rejected.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    "highly recommended",
    "overrated",
+    "we have ever had",
+    "we've ever had",


+                "(the noodles at least). Everything tastes better when you do it yourself! "
+                "Date day for a Saturday morning class. Great experience overall."
+            )
+        )


coderabbitai

♻️ Duplicate comments (1)

src/gmaps_scraper/place_scraper.py (1)

3748-3752: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize curly apostrophes before review-prose marker matching

At Line 3751, matching is case-folded but not apostrophe-normalized, so curly-apostrophe text can bypass straight-apostrophe markers (for example, we’ve vs we've).

Suggested fix

 def _looks_like_description_review_prose(value: str) -> bool:
     if len(value.split()) < 12:
         return False
-    lowered = value.casefold()
+    lowered = value.casefold().replace("’", "'")
     return any(marker in lowered for marker in _DESCRIPTION_REVIEW_PROSE_MARKERS)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/gmaps_scraper/place_scraper.py` around lines 3748 - 3752, The function
_looks_like_description_review_prose currently casefolds the input but doesn't
normalize curly apostrophes, so markers in _DESCRIPTION_REVIEW_PROSE_MARKERS
like "we've" won't match "we’ve"; modify _looks_like_description_review_prose to
first replace curly apostrophes (e.g., U+2018/U+2019) with the straight ASCII
apostrophe, then casefold that normalized string before running the any(marker
in lowered) test so marker matching succeeds for both curly and straight
apostrophes.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/gmaps_scraper/place_scraper.py`:
- Around line 3748-3752: The function _looks_like_description_review_prose
currently casefolds the input but doesn't normalize curly apostrophes, so
markers in _DESCRIPTION_REVIEW_PROSE_MARKERS like "we've" won't match "we’ve";
modify _looks_like_description_review_prose to first replace curly apostrophes
(e.g., U+2018/U+2019) with the straight ASCII apostrophe, then casefold that
normalized string before running the any(marker in lowered) test so marker
matching succeeds for both curly and straight apostrophes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cd966b13-2c3e-4591-9766-8e3924fcb1ce

📥 Commits

Reviewing files that changed from the base of the PR and between 09ed5db and 7061ef2.

📒 Files selected for processing (2)

src/gmaps_scraper/place_scraper.py
tests/test_place_scraper.py

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/gmaps_scraper/place_scraper.py:162

_looks_like_description_review_prose() uses value.casefold() without normalizing typographic apostrophes (e.g. ’). Because the marker list mixes ASCII and curly-quote variants (e.g. both i'd... and i’d...) but only has ASCII for others (e.g. we've ever had), some real-world strings like we’ve ever had will not match and will slip through. Consider normalizing the input (and/or markers) the same way _looks_like_review_response_text() does (e.g. replace ’ with ' before searching) so you can keep a single canonical marker set and avoid missed matches.

_DESCRIPTION_REVIEW_PROSE_MARKERS = (
    "boy was it worth",
    "best place to stay",
    "definitely recommend this place",
    "great experience overall",
    "hidden gem-literally",
    "i forgot his name",
    "i'd just finished",
    "i’d just finished",
    "highly recommended",
    "i've tasted",
    "i’ve tasted",
    "it was my first attempt",
    "omfg",
    "overrated",
    "so yummy",
    "the katsu burger",
    "the rooms were huge",
    "we have ever had",
    "we've ever had",
    "what a great hotel",
    "your children",
    "your kids",
    "you should",
)

+            )
+        )
+        self.assertIsNone(
+            _clean_description_text(


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7061ef2501

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-10T12:47:51Z

+    "omfg",
    "overrated",
+    "so yummy",
+    "the katsu burger",


Avoid rejecting dish-name descriptions

When a legitimate place description mentions a katsu burger in 12+ words, _looks_like_description_review_prose() now discards the entire description solely because it contains the katsu burger. The review fixture this targets would already be rejected by the neighboring omfg / so yummy markers, but this standalone dish-name marker also matches normal restaurant copy such as a description of a signature menu item, causing valid description output to be lost for those places.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 468ab3a0f0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-10T13:47:10Z

 )
 _DESCRIPTION_REVIEW_PROSE_MARKERS = (
+    "boy was it worth",
+    "best place to stay",


Avoid dropping hotel marketing descriptions

Because _looks_like_description_review_prose() applies these markers to any 12+ word description without requiring a first-person pronoun, a legitimate hotel summary such as “The best place to stay in Hanoi for families, with spacious rooms and a central location” is now discarded solely due to this substring. That removes valid description output for affected lodging pages rather than just filtering leaked reviews.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_place_scraper.py`:
- Around line 1541-1545: Several long string literals in the new test cases
(e.g., "The lady was just so so lovely. My feet are just gorgeous. Would
recommend to everyone.", "My stay in Alila was wonderful. Special shout out to
the staff for making it memorable.", "The hotel have a sense of peace and
tranquility once step in. The personal service was delicate.", and "Directions
Save Nearby Send to phone Share About this data Get the most out of Google Maps
Sign in") exceed 100 characters causing E501 failures; fix by breaking each long
literal into shorter pieces using implicit string concatenation inside
parentheses or by splitting into two quoted parts joined with + so each source
line is under 100 chars (locate the list/array of test strings in
tests/test_place_scraper.py by searching for any of the quoted strings above and
update those entries).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dc334091-d552-44c9-9021-fc836c89e07a

📥 Commits

Reviewing files that changed from the base of the PR and between 7061ef2 and a749f22.

📒 Files selected for processing (2)

src/gmaps_scraper/place_scraper.py
tests/test_place_scraper.py

coderabbitai · 2026-05-10T15:47:32Z

+            "The lady was just so so lovely. My feet are just gorgeous. Would recommend to everyone.",
+            "My stay in Alila was wonderful. Special shout out to the staff for making it memorable.",
+            "The hotel have a sense of peace and tranquility once step in. The personal service was delicate.",
+            "The staffs also offered great recommendation for drinks based on your preference.",
+            "Directions Save Nearby Send to phone Share About this data Get the most out of Google Maps Sign in",


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix E501 line-length failures in the new looped test cases

CI is failing on Line 1541, Line 1542, Line 1543, and Line 1545 due to >100-char literals.

Suggested minimal formatting fix

- "The lady was just so so lovely. My feet are just gorgeous. Would recommend to everyone.", - "My stay in Alila was wonderful. Special shout out to the staff for making it memorable.", - "The hotel have a sense of peace and tranquility once step in. The personal service was delicate.", - "The staffs also offered great recommendation for drinks based on your preference.", - "Directions Save Nearby Send to phone Share About this data Get the most out of Google Maps Sign in", + ( + "The lady was just so so lovely. My feet are just gorgeous. " + "Would recommend to everyone." + ), + ( + "My stay in Alila was wonderful. Special shout out to the staff " + "for making it memorable." + ), + ( + "The hotel have a sense of peace and tranquility once step in. " + "The personal service was delicate." + ), + ( + "The staffs also offered great recommendation for drinks based on " + "your preference." + ), + ( + "Directions Save Nearby Send to phone Share About this data " + "Get the most out of Google Maps Sign in" + ),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"The lady was just so so lovely. My feet are just gorgeous. Would recommend to everyone.",

"My stay in Alila was wonderful. Special shout out to the staff for making it memorable.",

"The hotel have a sense of peace and tranquility once step in. The personal service was delicate.",

"The staffs also offered great recommendation for drinks based on your preference.",

"Directions Save Nearby Send to phone Share About this data Get the most out of Google Maps Sign in",

(

"The lady was just so so lovely. My feet are just gorgeous. "

"Would recommend to everyone."

),

(

"My stay in Alila was wonderful. Special shout out to the staff "

"for making it memorable."

),

(

"The hotel have a sense of peace and tranquility once step in. "

"The personal service was delicate."

),

(

"The staffs also offered great recommendation for drinks based on "

"your preference."

),

(

"Directions Save Nearby Send to phone Share About this data "

"Get the most out of Google Maps Sign in"

),

🧰 Tools

🪛 GitHub Actions: CI / 2_lint.txt

[error] 1541-1541: E501 Line too long (102 > 100) reported by lint (flake8/pycodestyle style check).

[error] 1542-1542: E501 Line too long (102 > 100) reported by lint (flake8/pycodestyle style check).

[error] 1543-1543: E501 Line too long (111 > 100) reported by lint (flake8/pycodestyle style check).

[error] 1545-1545: E501 Line too long (113 > 100) reported by lint (flake8/pycodestyle style check).

🪛 GitHub Actions: CI / lint

[error] 1541-1541: Lint (E501): Line too long (102 > 100).

[error] 1542-1542: Lint (E501): Line too long (102 > 100).

[error] 1543-1543: Lint (E501): Line too long (111 > 100).

[error] 1545-1545: Lint (E501): Line too long (113 > 100).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_place_scraper.py` around lines 1541 - 1545, Several long string literals in the new test cases (e.g., "The lady was just so so lovely. My feet are just gorgeous. Would recommend to everyone.", "My stay in Alila was wonderful. Special shout out to the staff for making it memorable.", "The hotel have a sense of peace and tranquility once step in. The personal service was delicate.", and "Directions Save Nearby Send to phone Share About this data Get the most out of Google Maps Sign in") exceed 100 characters causing E501 failures; fix by breaking each long literal into shorter pieces using implicit string concatenation inside parentheses or by splitting into two quoted parts joined with + so each source line is under 100 chars (locate the list/array of test strings in tests/test_place_scraper.py by searching for any of the quoted strings above and update those entries).

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

 _DESCRIPTION_REVIEW_PROSE_MARKERS = (
+    "boy was it worth",
+    "best place to stay",
+    "definitely recommend this place",
+    "great experience overall",
+    "had a great time",
+    "hidden gem-literally",
+    "i forgot his name",
+    "i'd just finished",
+    "i’d just finished",
    "highly recommended",
+    "i've tasted",
+    "i’ve tasted",
+    "it was my first attempt",
+    "my stay in",
+    "once step in",
+    "offered great recommendation",
+    "omfg",
    "overrated",
+    "so yummy",
+    "the katsu burger",
+    "the rooms were huge",
+    "we have ever had",
+    "we've ever had",
+    "what a great hotel",
+    "would recommend to everyone",
+    "about this data",
+    "get the most out of google maps",
    "your children",
    "your kids",
    "you should",


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a749f22375

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-10T15:48:09Z

+    "about this data",
+    "get the most out of google maps",


Move chrome markers out of the review-prose gate

These Google Maps chrome markers are added to _DESCRIPTION_REVIEW_PROSE_MARKERS, but _looks_like_description_review_prose() returns False for any value under 12 words before checking this list. When the page text is just the common combined footer phrase About this data Get the most out of Google Maps (9 words) rather than the longer fixture with Directions/Save prefixes, _clean_description_text() still accepts it as a description because it is not an exact stop-marker match. Put these UI/footer phrases in a substring stop list or check them before the review-prose length gate.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f79b6a241a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-12T11:49:18Z

    panel_lines = _body_lines(snapshot.get("panel_text"))
-    body_lines = _body_lines(snapshot.get("body_text"))
-    search_lines = panel_lines or body_lines
-    combined_lines = _dedupe_lines([*panel_lines, *body_lines])
+    search_lines = panel_lines
+    combined_lines = _dedupe_lines(panel_lines)


Preserve search-card panel lines for fallback extraction

When a /maps/search result is selected but opening the place page fails, _search_result_snapshot() still copies the card text only as body_text, while this change makes _build_place_details() ignore body_text and build all line-based fallbacks from panel_text only. In that fallback path the selected card's lines are therefore invisible to _extract_status_from_lines, _extract_phone_from_lines, _extract_plus_code_from_lines, etc., even though the search-result JS now emits panel_text; copy panel_text through or include the search-card body_text in this narrow path.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/gmaps_scraper/place_scraper.py`:
- Around line 4347-4350: The check currently only matches labels that start with
"sponsored " and misses the exact "sponsored" label; update the return condition
in the block that computes normalized_lookup (after normalized.casefold()) to
also treat an exact match by adding a check like normalized_lookup ==
"sponsored" (in addition to the existing startswith check) when testing against
_SEARCH_RESULTS_LABELS so both "sponsored" and values starting with "sponsored "
are considered search-result labels.
- Around line 1428-1443: The cardDescription function currently returns the
first matching row text, which can be UI-action tokens like
"Share"/"Save"/"Call" and prevent real description rows from being considered;
update the loop in cardDescription to treat short UI action tokens as
non-descriptive by continuing instead of returning: add a check against common
UI verbs/labels (e.g., /^((share|save|call|website|directions|get
directions|saved in|view menu|write a review)\b)/i or a small-word-length
heuristic) using the existing text variable and rows iteration, and only return
text when it does not match those UI-action patterns and looks like a real
description.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 37f8fa43-135d-40b4-b6af-2b0c5ecb44b4

📥 Commits

Reviewing files that changed from the base of the PR and between a749f22 and f79b6a2.

📒 Files selected for processing (2)

src/gmaps_scraper/place_scraper.py
tests/test_place_scraper.py

coderabbitai · 2026-05-12T11:58:05Z

+      if (
+        row.querySelector(
+          ".AJB7ye, .UsdlK, [role='img'][aria-label*='star' i], a[href^='tel:']",
+        )
+      ) {
+        continue;
      }
      if (/^(open|closed|temporarily closed|website|directions|saved in)\b/i.test(text)) {
-        return false;
+        continue;
      }
      if (/^[+()\d\s.-]{7,}$/.test(text)) {
-        return false;
+        continue;
      }
-      return text.length >= 12;
-    }) || null;
+      return text;
+    }
+    return null;


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid returning UI-action tokens from cardDescription before real description rows.

If the first matched row is "Share"/"Save"/"Call", this function returns it and the Python cleaner drops it later, so later valid description rows are never considered.

Suggested minimal fix

- if (/^(open|closed|temporarily closed|website|directions|saved in)\b/i.test(text)) { + if ( + /^(open|closed|temporarily closed|website|directions|saved in|call|save|saved|share|nearby|send to phone|sign in)\b/i.test(text) + ) { continue; }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/gmaps_scraper/place_scraper.py` around lines 1428 - 1443, The cardDescription function currently returns the first matching row text, which can be UI-action tokens like "Share"/"Save"/"Call" and prevent real description rows from being considered; update the loop in cardDescription to treat short UI action tokens as non-descriptive by continuing instead of returning: add a check against common UI verbs/labels (e.g., /^((share|save|call|website|directions|get directions|saved in|view menu|write a review)\b)/i or a small-word-length heuristic) using the existing text variable and rows iteration, and only return text when it does not match those UI-action patterns and looks like a real description.

coderabbitai · 2026-05-12T11:58:05Z

+    normalized_lookup = normalized.casefold()
+    return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith(
+        "sponsored "
+    )


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Treat exact "Sponsored" as a search-result label too.

Line 4348 only catches values that start with "sponsored " (with a trailing space). A plain "Sponsored" label can still pass name/category cleaning.

Suggested fix

- return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith( - "sponsored " - ) + return ( + normalized_lookup in _SEARCH_RESULTS_LABELS + or normalized_lookup == "sponsored" + or normalized_lookup.startswith("sponsored ") + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

normalized_lookup = normalized.casefold()

return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith(

"sponsored "

)

normalized_lookup = normalized.casefold()

return (

normalized_lookup in _SEARCH_RESULTS_LABELS

or normalized_lookup == "sponsored"

or normalized_lookup.startswith("sponsored ")

)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/gmaps_scraper/place_scraper.py` around lines 4347 - 4350, The check currently only matches labels that start with "sponsored " and misses the exact "sponsored" label; update the return condition in the block that computes normalized_lookup (after normalized.casefold()) to also treat an exact match by adding a check like normalized_lookup == "sponsored" (in addition to the existing startswith check) when testing against _SEARCH_RESULTS_LABELS so both "sponsored" and values starting with "sponsored " are considered search-result labels.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cc63505947

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-12T16:13:26Z

+    if _parse_list_owner(value) is not None:
+        return None


Preserve singleton CID lists

When a place metadata CID is stored in the common singleton-list shape like ["1234567890123456789"], this guard now treats that list as a list-owner payload because _parse_list_owner() accepts any 1-item list with a non-empty string as name. _find_cid() checks metadata index 6 through this path first, so those places lose their cid and fall back to query-only maps_url/dedupe behavior even though a valid CID is present.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

 def _find_cid_in_value(value: JSONValue | None) -> str | None:
    if isinstance(value, int):
        return _normalize_cid_token(str(value))
    if isinstance(value, str):
        return _normalize_cid_token(value)
    if not isinstance(value, list):
        return None
+    owner = _parse_list_owner(value)
+    if owner is not None and (owner.photo_url is not None or owner.profile_id is not None):


+    search_lines = panel_lines
+    combined_lines = _dedupe_lines(panel_lines)


Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

+  const cardDescription = (article, excludedValues) => {
    const excluded = new Set(excludedValues.map(cleanLine).filter(Boolean));
-    return lines.find((line) => {
-      const text = cleanLine(line);
+    const rows = Array.from(article.querySelectorAll("div.W4Efsd"));
+    for (const row of rows) {
+      const text = cleanLine(row.innerText || row.textContent || "");
      if (!text || excluded.has(text)) {
-        return false;
+        continue;
      }
      if (text.includes("·") || parseCardRating(text) || parseCardReviewCount(text)) {
-        return false;
+        continue;
+      }
+      if (
+        row.querySelector(
+          ".AJB7ye, .UsdlK, [role='img'][aria-label*='star' i], a[href^='tel:']",
+        )
+      ) {
+        continue;
      }
      if (/^(open|closed|temporarily closed|website|directions|saved in)\b/i.test(text)) {
-        return false;
+        continue;
      }
      if (/^[+()\d\s.-]{7,}$/.test(text)) {
-        return false;
+        continue;
      }
-      return text.length >= 12;
-    }) || null;
+      return text;
+    }


+    const candidates = [];
+    for (const element of panel.querySelectorAll("div, span")) {
+      const text = cleanLine(element.innerText || element.textContent || "");
+      if (!text || text.includes("·")) {
+        continue;
+      }
+      if (
+        element.closest(
+          "button, a, [role='button'], [data-item-id], [data-review-id], div.F7nice",
+        )
+      ) {
+        continue;
+      }
+      if (
+        Array.from(element.children).some(
+          (child) => cleanLine(child.innerText || child.textContent || "") === text,
+        )
+      ) {
+        continue;
+      }
+      const top = elementTop(element);
+      if (top === null || top <= titleBottom || top >= boundaryTop) {
+        continue;
+      }
+      candidates.push({top, text});
+    }
+    candidates.sort((left, right) => left.top - right.top);
+    return candidates[0]?.text || null;


coderabbitai

♻️ Duplicate comments (3)

src/gmaps_scraper/place_scraper.py (3)

4346-4353: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle exact Sponsored the same way as Sponsored ….

The new check still lets a plain "Sponsored" label through because it only tests startswith("sponsored ").

Suggested fix

-    return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith(
-        "sponsored "
-    )
+    return (
+        normalized_lookup in _SEARCH_RESULTS_LABELS
+        or normalized_lookup == "sponsored"
+        or normalized_lookup.startswith("sponsored ")
+    )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/gmaps_scraper/place_scraper.py` around lines 4346 - 4353, The function
_looks_like_search_results_label currently lets a plain "Sponsored" pass because
it only tests normalized_lookup.startswith("sponsored "), so update the
condition inside _looks_like_search_results_label (which uses _clean_text and
_SEARCH_RESULTS_LABELS) to treat the exact "sponsored" the same as "sponsored …"
by checking either normalized_lookup == "sponsored" or
normalized_lookup.startswith("sponsored "), combined with the existing
membership test against _SEARCH_RESULTS_LABELS.

3832-3836: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize curly apostrophes before review-prose matching.

value.casefold() still misses inputs like we’ve ever had unless every marker is duplicated in both ASCII and curly forms. That lets targeted review prose through.

Suggested fix

 def _looks_like_description_review_prose(value: str) -> bool:
     if len(value.split()) < 12:
         return False
-    lowered = value.casefold()
+    lowered = value.casefold().replace("’", "'").replace("‘", "'")
     return any(marker in lowered for marker in _DESCRIPTION_REVIEW_PROSE_MARKERS)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/gmaps_scraper/place_scraper.py` around lines 3832 - 3836, The function
_looks_like_description_review_prose should normalize curly apostrophes before
doing casefold/matching; update it to replace Unicode curly apostrophes (e.g.,
U+2019 and U+2018) with the ASCII apostrophe (') on the input string (value =
value.replace("\u2019", "'").replace("\u2018", "'")) before calling casefold(),
then perform the existing marker check against _DESCRIPTION_REVIEW_PROSE_MARKERS
so markers like "we've" match even when the input contains curly apostrophes.

1419-1445: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip short UI-action rows before returning search_result_description.

If the first div.W4Efsd row is "Share", "Save", or "Call", this returns it immediately, the Python cleaner drops it later, and any real description row below never gets considered.

Suggested fix

-      if (/^(open|closed|temporarily closed|website|directions|saved in)\b/i.test(text)) {
+      if (
+        /^(open|closed|temporarily closed|website|directions|saved in|call|save|saved|share|nearby|send to phone|sign in)\b/i.test(text)
+      ) {
         continue;
       }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/gmaps_scraper/place_scraper.py` around lines 1419 - 1445, The
cardDescription routine returns short UI-action rows like "Share", "Save", or
"Call" before real descriptions; update cardDescription (used with cleanLine,
parseCardRating, parseCardReviewCount) to explicitly skip common UI-action
labels by adding a check (e.g., a case-insensitive regex for "share", "save",
"call", "directions", "website", "more", etc.) before returning text so those
tokens are continued over and the next non-UI descriptive row is returned for
search_result_description.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/gmaps_scraper/place_scraper.py`:
- Around line 4346-4353: The function _looks_like_search_results_label currently
lets a plain "Sponsored" pass because it only tests
normalized_lookup.startswith("sponsored "), so update the condition inside
_looks_like_search_results_label (which uses _clean_text and
_SEARCH_RESULTS_LABELS) to treat the exact "sponsored" the same as "sponsored …"
by checking either normalized_lookup == "sponsored" or
normalized_lookup.startswith("sponsored "), combined with the existing
membership test against _SEARCH_RESULTS_LABELS.
- Around line 3832-3836: The function _looks_like_description_review_prose
should normalize curly apostrophes before doing casefold/matching; update it to
replace Unicode curly apostrophes (e.g., U+2019 and U+2018) with the ASCII
apostrophe (') on the input string (value = value.replace("\u2019",
"'").replace("\u2018", "'")) before calling casefold(), then perform the
existing marker check against _DESCRIPTION_REVIEW_PROSE_MARKERS so markers like
"we've" match even when the input contains curly apostrophes.
- Around line 1419-1445: The cardDescription routine returns short UI-action
rows like "Share", "Save", or "Call" before real descriptions; update
cardDescription (used with cleanLine, parseCardRating, parseCardReviewCount) to
explicitly skip common UI-action labels by adding a check (e.g., a
case-insensitive regex for "share", "save", "call", "directions", "website",
"more", etc.) before returning text so those tokens are continued over and the
next non-UI descriptive row is returned for search_result_description.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ce5aad73-7396-4468-a684-de53d766191b

📥 Commits

Reviewing files that changed from the base of the PR and between f79b6a2 and b569056.

📒 Files selected for processing (4)

src/gmaps_scraper/parser.py
src/gmaps_scraper/place_scraper.py
tests/test_parser.py
tests/test_place_scraper.py

Copilot AI review requested due to automatic review settings May 10, 2026 08:32

Copilot started reviewing on behalf of michaelmwu May 10, 2026 08:32 View session

coderabbitai Bot reviewed May 10, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

Copilot AI reviewed May 10, 2026

View reviewed changes

michaelmwu force-pushed the codex/reject-ramen-review-description branch from 09ed5db to fda07a5 Compare May 10, 2026 11:14

Copilot AI review requested due to automatic review settings May 10, 2026 12:42

michaelmwu force-pushed the codex/reject-ramen-review-description branch from fda07a5 to 7061ef2 Compare May 10, 2026 12:42

Copilot started reviewing on behalf of michaelmwu May 10, 2026 12:43 View session

coderabbitai Bot reviewed May 10, 2026

View reviewed changes

Copilot AI reviewed May 10, 2026

View reviewed changes

Comment thread tests/test_place_scraper.py

)

)

self.assertIsNone(

_clean_description_text(

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

michaelmwu force-pushed the codex/reject-ramen-review-description branch from 7061ef2 to 468ab3a Compare May 10, 2026 13:45

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 10, 2026 15:45

michaelmwu force-pushed the codex/reject-ramen-review-description branch from 468ab3a to a749f22 Compare May 10, 2026 15:45

Copilot started reviewing on behalf of michaelmwu May 10, 2026 15:46 View session

coderabbitai Bot reviewed May 10, 2026

View reviewed changes

Copilot AI reviewed May 10, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

michaelmwu added 2 commits May 12, 2026 20:43

Reject first-person review descriptions

56aa783

Reject Google Maps UI action descriptions

f79b6a2

michaelmwu force-pushed the codex/reject-ramen-review-description branch from a749f22 to f79b6a2 Compare May 12, 2026 11:45

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Ignore owner payloads when extracting place CIDs

cc63505

Copilot AI review requested due to automatic review settings May 12, 2026 16:10

Copilot started reviewing on behalf of michaelmwu May 12, 2026 16:10 View session

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

Allow numeric CID arrays while skipping owner payloads

d2c6fff

Copilot AI reviewed May 12, 2026

View reviewed changes

michaelmwu added 2 commits May 13, 2026 07:59

Preserve search result panel text

5396fc6

Fix lint in panel text test

6bac110

Copilot AI review requested due to automatic review settings May 12, 2026 23:00

Copilot started reviewing on behalf of michaelmwu May 12, 2026 23:01 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Ignore tab strips in description fallback

b569056

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

michaelmwu changed the title ~~Reject first-person review descriptions~~ Harden Google Maps place extraction May 13, 2026

michaelmwu merged commit 8a94e9c into main May 13, 2026
4 checks passed

michaelmwu deleted the codex/reject-ramen-review-description branch May 13, 2026 23:16

-            "The lady was just so so lovely. My feet are just gorgeous. Would recommend to everyone.",
-            "My stay in Alila was wonderful. Special shout out to the staff for making it memorable.",
-            "The hotel have a sense of peace and tranquility once step in. The personal service was delicate.",
-            "The staffs also offered great recommendation for drinks based on your preference.",
-            "Directions Save Nearby Send to phone Share About this data Get the most out of Google Maps Sign in",
+            (
+                "The lady was just so so lovely. My feet are just gorgeous. "
+                "Would recommend to everyone."
+            ),
+            (
+                "My stay in Alila was wonderful. Special shout out to the staff "
+                "for making it memorable."
+            ),
+            (
+                "The hotel have a sense of peace and tranquility once step in. "
+                "The personal service was delicate."
+            ),
+            (
+                "The staffs also offered great recommendation for drinks based on "
+                "your preference."
+            ),
+            (
+                "Directions Save Nearby Send to phone Share About this data "
+                "Get the most out of Google Maps Sign in"
+            ),

		search_lines = panel_lines
		combined_lines = _dedupe_lines(panel_lines)

Conversation

michaelmwu commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

coderabbitai Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

michaelmwu commented May 10, 2026 •

edited

Loading

coderabbitai Bot commented May 10, 2026 •

edited

Loading