Skip to content

Harden Google Maps place extraction#37

Merged
michaelmwu merged 7 commits into
mainfrom
codex/reject-ramen-review-description
May 13, 2026
Merged

Harden Google Maps place extraction#37
michaelmwu merged 7 commits into
mainfrom
codex/reject-ramen-review-description

Conversation

@michaelmwu
Copy link
Copy Markdown
Member

@michaelmwu michaelmwu commented May 10, 2026

Summary

  • reject first-person/review-style Google Maps descriptions, Google Maps UI action clusters, sponsored labels, and tab-strip text
  • keep description fallback selection from accepting Overview/Reviews/About tab containers when direct selectors are absent
  • preserve selected search-card panel lines so fallback detail parsing can still recover name/category/address/plus-code data
  • ignore owner/user payloads when extracting CIDs while still allowing legitimate numeric CID arrays
  • add regression coverage for description cleaning, panel-text fallback parsing, tab exclusion, photo URL validation, and CID parsing

Validation

  • uv run python3 -m unittest tests.test_place_scraper.PlaceScraperTests.test_clean_description_text_rejects_first_person_review_prose
  • uv run python3 -m unittest discover -s tests -p 'test_place_scraper.py'
  • PR checks are currently green on GitHub

Copilot AI review requested due to automatic review settings May 10, 2026 08:32
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Refined description selection and cleaning: expanded review-prose markers and UI-action-cluster detection, rewrote place-page and search-card extractors to select descriptionValue(), tightened photo selectors/URL normalization, simplified Python extraction/fallbacks, and updated tests and snapshots.

Changes

Place extraction & cleaning

Layer / File(s) Summary
Description markers and UI-cluster detection
src/gmaps_scraper/place_scraper.py
Expanded _DESCRIPTION_REVIEW_PROSE_MARKERS, added _UI_ACTION_CLUSTER_RE and _looks_like_ui_action_cluster(), and updated _clean_description_text and _looks_like_search_results_label to reject UI clusters and sponsored labels.
Place-page DOM extractor updates
src/gmaps_scraper/place_scraper.py, tests/test_place_scraper.py
Place-page JS extractor computes element top/bottom boundaries, picks a description via descriptionValue(), adds elementTop/elementBottom helpers, expands main photo selectors, and updated JS extractor test expectations.
Search-result card description selection
src/gmaps_scraper/place_scraper.py, tests/test_place_scraper.py
Replaced findDescriptionLine with cardDescription using an exclusion set and row filters (skip star/tel/website/directions/open/closed patterns); search_result_description is sourced from cardDescription(...) and tests updated; snapshot builder now preserves panel_text.
_build_place_details fallbacks & tests
src/gmaps_scraper/place_scraper.py, tests/test_place_scraper.py
combined_lines/fallbacks now derive from panel_text only; _extract_description simplified to accept snapshot and call _clean_description_text(snapshot.get("description")); multiple tests/snapshots updated (description injection, coords, limited_view, admission_prices, sponsored name rejection).
Photo selectors and URL normalization
src/gmaps_scraper/place_scraper.py, tests/test_place_scraper.py
Expanded mainPhotoUrl selector lists; _normalize_photo_url rejects additional lh*.googleusercontent.com/ggpht.com hosts and requires w{digits}-h{digits} size patterns; tests accept a gps-cs-s URL variant and reject ad-thumbnail/unshaped-host style photo URLs.

Parser owner-payload guard

Layer / File(s) Summary
_find_cid_in_value owner guard and test
src/gmaps_scraper/parser.py, tests/test_parser.py
Early guard to treat list-shaped ListOwner-like payloads (with photo_url or profile_id) as non-CID values; unit test added to ensure no CID is extracted from owner metadata.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I nibble through DOMs and tidy up the text,
I hop past starry rows and sponsored pretext.
I prune the photo links and guard the IDs,
Clean snippets hop out, neat as carrot-tide.
Tiny paws, bright nose — scraper joy amplified.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Harden Google Maps place extraction' is overly broad and generic, referring to hardening/robustness improvements but not conveying the main change (rejecting first-person review descriptions). Consider a more specific title like 'Reject first-person review descriptions in place extraction' that clearly indicates the primary change.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/reject-ramen-review-description

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/gmaps_scraper/place_scraper.py`:
- Around line 139-143: The matching fails for curly apostrophes (e.g., “we’ve
ever had”) because the review-prose markers list contains straight-apostrophe
strings like "we've ever had"; before running the marker checks (the code that
iterates over strings such as "we've ever had", "we have ever had", "we've ever
had", "overrated", etc.) normalize the review text (or both the text and
markers) by replacing curly/apostrophe variants (e.g., \u2019, \u2018) with the
ASCII apostrophe (') and any similar typographic quotes so that "we’ve ever had"
matches "we've ever had".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b8399c54-fd11-40c5-9e46-9693ab22485f

📥 Commits

Reviewing files that changed from the base of the PR and between a7900da and 09ed5db.

📒 Files selected for processing (2)
  • src/gmaps_scraper/place_scraper.py
  • tests/test_place_scraper.py

Comment on lines +139 to +143
"great experience overall",
"highly recommended",
"overrated",
"we have ever had",
"we've ever had",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize apostrophes before matching review-prose markers.

At Line 143, "we've ever had" won’t match curly-apostrophe text (we’ve ever had), so first-person review prose can still pass through.

Suggested fix
 def _looks_like_description_review_prose(value: str) -> bool:
     if len(value.split()) < 12:
         return False
-    lowered = value.casefold()
+    lowered = value.casefold().replace("’", "'")
     return any(marker in lowered for marker in _DESCRIPTION_REVIEW_PROSE_MARKERS)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/gmaps_scraper/place_scraper.py` around lines 139 - 143, The matching
fails for curly apostrophes (e.g., “we’ve ever had”) because the review-prose
markers list contains straight-apostrophe strings like "we've ever had"; before
running the marker checks (the code that iterates over strings such as "we've
ever had", "we have ever had", "we've ever had", "overrated", etc.) normalize
the review text (or both the text and markers) by replacing curly/apostrophe
variants (e.g., \u2019, \u2018) with the ASCII apostrophe (') and any similar
typographic quotes so that "we’ve ever had" matches "we've ever had".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 09ed5db706

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"highly recommended",
"overrated",
"we have ever had",
"we've ever had",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize curly apostrophes for review-prose markers

When the leaked review uses a typographic apostrophe, e.g. Best ramen we’ve ever had ..., this new marker will not match because _looks_like_description_review_prose() only casefold()s the text and does not replace with ' like _looks_like_review_response_text() does. Since had is not one of the first-person experience verbs, a 12+ word review with we’ve ever had and no other marker can still be accepted as a place description, leaving the class of leak this change is meant to reject.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request tightens description cleaning in place_scraper to prevent first-person / review-style prose from being treated as a place description, and adds a regression test for a ramen-class review snippet that previously leaked through.

Changes:

  • Extend the review-prose marker list to reject additional review-like phrases (e.g., “great experience overall”, “we’ve ever had”).
  • Add a unit test asserting that a ramen-class review-style description is rejected by _clean_description_text().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/gmaps_scraper/place_scraper.py Adds new review-prose substring markers used to reject leaked review-style descriptions.
tests/test_place_scraper.py Adds a regression test ensuring a “Best ramen we’ve ever had…”-style snippet is rejected.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"highly recommended",
"overrated",
"we have ever had",
"we've ever had",
"(the noodles at least). Everything tastes better when you do it yourself! "
"Date day for a Saturday morning class. Great experience overall."
)
)
@michaelmwu michaelmwu force-pushed the codex/reject-ramen-review-description branch from 09ed5db to fda07a5 Compare May 10, 2026 11:14
Copilot AI review requested due to automatic review settings May 10, 2026 12:42
@michaelmwu michaelmwu force-pushed the codex/reject-ramen-review-description branch from fda07a5 to 7061ef2 Compare May 10, 2026 12:42
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/gmaps_scraper/place_scraper.py (1)

3748-3752: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize curly apostrophes before review-prose marker matching

At Line 3751, matching is case-folded but not apostrophe-normalized, so curly-apostrophe text can bypass straight-apostrophe markers (for example, we’ve vs we've).

Suggested fix
 def _looks_like_description_review_prose(value: str) -> bool:
     if len(value.split()) < 12:
         return False
-    lowered = value.casefold()
+    lowered = value.casefold().replace("’", "'")
     return any(marker in lowered for marker in _DESCRIPTION_REVIEW_PROSE_MARKERS)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/gmaps_scraper/place_scraper.py` around lines 3748 - 3752, The function
_looks_like_description_review_prose currently casefolds the input but doesn't
normalize curly apostrophes, so markers in _DESCRIPTION_REVIEW_PROSE_MARKERS
like "we've" won't match "we’ve"; modify _looks_like_description_review_prose to
first replace curly apostrophes (e.g., U+2018/U+2019) with the straight ASCII
apostrophe, then casefold that normalized string before running the any(marker
in lowered) test so marker matching succeeds for both curly and straight
apostrophes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/gmaps_scraper/place_scraper.py`:
- Around line 3748-3752: The function _looks_like_description_review_prose
currently casefolds the input but doesn't normalize curly apostrophes, so
markers in _DESCRIPTION_REVIEW_PROSE_MARKERS like "we've" won't match "we’ve";
modify _looks_like_description_review_prose to first replace curly apostrophes
(e.g., U+2018/U+2019) with the straight ASCII apostrophe, then casefold that
normalized string before running the any(marker in lowered) test so marker
matching succeeds for both curly and straight apostrophes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cd966b13-2c3e-4591-9766-8e3924fcb1ce

📥 Commits

Reviewing files that changed from the base of the PR and between 09ed5db and 7061ef2.

📒 Files selected for processing (2)
  • src/gmaps_scraper/place_scraper.py
  • tests/test_place_scraper.py

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/gmaps_scraper/place_scraper.py:162

  • _looks_like_description_review_prose() uses value.casefold() without normalizing typographic apostrophes (e.g. ). Because the marker list mixes ASCII and curly-quote variants (e.g. both i'd... and i’d...) but only has ASCII for others (e.g. we've ever had), some real-world strings like we’ve ever had will not match and will slip through. Consider normalizing the input (and/or markers) the same way _looks_like_review_response_text() does (e.g. replace with ' before searching) so you can keep a single canonical marker set and avoid missed matches.
_DESCRIPTION_REVIEW_PROSE_MARKERS = (
    "boy was it worth",
    "best place to stay",
    "definitely recommend this place",
    "great experience overall",
    "hidden gem-literally",
    "i forgot his name",
    "i'd just finished",
    "i’d just finished",
    "highly recommended",
    "i've tasted",
    "i’ve tasted",
    "it was my first attempt",
    "omfg",
    "overrated",
    "so yummy",
    "the katsu burger",
    "the rooms were huge",
    "we have ever had",
    "we've ever had",
    "what a great hotel",
    "your children",
    "your kids",
    "you should",
)

)
)
self.assertIsNone(
_clean_description_text(
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7061ef2501

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"omfg",
"overrated",
"so yummy",
"the katsu burger",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid rejecting dish-name descriptions

When a legitimate place description mentions a katsu burger in 12+ words, _looks_like_description_review_prose() now discards the entire description solely because it contains the katsu burger. The review fixture this targets would already be rejected by the neighboring omfg / so yummy markers, but this standalone dish-name marker also matches normal restaurant copy such as a description of a signature menu item, causing valid description output to be lost for those places.

Useful? React with 👍 / 👎.

@michaelmwu michaelmwu force-pushed the codex/reject-ramen-review-description branch from 7061ef2 to 468ab3a Compare May 10, 2026 13:45
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 468ab3a0f0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

)
_DESCRIPTION_REVIEW_PROSE_MARKERS = (
"boy was it worth",
"best place to stay",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid dropping hotel marketing descriptions

Because _looks_like_description_review_prose() applies these markers to any 12+ word description without requiring a first-person pronoun, a legitimate hotel summary such as “The best place to stay in Hanoi for families, with spacious rooms and a central location” is now discarded solely due to this substring. That removes valid description output for affected lodging pages rather than just filtering leaked reviews.

Useful? React with 👍 / 👎.

Copilot AI review requested due to automatic review settings May 10, 2026 15:45
@michaelmwu michaelmwu force-pushed the codex/reject-ramen-review-description branch from 468ab3a to a749f22 Compare May 10, 2026 15:45
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_place_scraper.py`:
- Around line 1541-1545: Several long string literals in the new test cases
(e.g., "The lady was just so so lovely. My feet are just gorgeous. Would
recommend to everyone.", "My stay in Alila was wonderful. Special shout out to
the staff for making it memorable.", "The hotel have a sense of peace and
tranquility once step in. The personal service was delicate.", and "Directions
Save Nearby Send to phone Share About this data Get the most out of Google Maps
Sign in") exceed 100 characters causing E501 failures; fix by breaking each long
literal into shorter pieces using implicit string concatenation inside
parentheses or by splitting into two quoted parts joined with + so each source
line is under 100 chars (locate the list/array of test strings in
tests/test_place_scraper.py by searching for any of the quoted strings above and
update those entries).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dc334091-d552-44c9-9021-fc836c89e07a

📥 Commits

Reviewing files that changed from the base of the PR and between 7061ef2 and a749f22.

📒 Files selected for processing (2)
  • src/gmaps_scraper/place_scraper.py
  • tests/test_place_scraper.py

Comment thread tests/test_place_scraper.py Outdated
Comment on lines +1541 to +1545
"The lady was just so so lovely. My feet are just gorgeous. Would recommend to everyone.",
"My stay in Alila was wonderful. Special shout out to the staff for making it memorable.",
"The hotel have a sense of peace and tranquility once step in. The personal service was delicate.",
"The staffs also offered great recommendation for drinks based on your preference.",
"Directions Save Nearby Send to phone Share About this data Get the most out of Google Maps Sign in",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix E501 line-length failures in the new looped test cases

CI is failing on Line 1541, Line 1542, Line 1543, and Line 1545 due to >100-char literals.

Suggested minimal formatting fix
-            "The lady was just so so lovely. My feet are just gorgeous. Would recommend to everyone.",
-            "My stay in Alila was wonderful. Special shout out to the staff for making it memorable.",
-            "The hotel have a sense of peace and tranquility once step in. The personal service was delicate.",
-            "The staffs also offered great recommendation for drinks based on your preference.",
-            "Directions Save Nearby Send to phone Share About this data Get the most out of Google Maps Sign in",
+            (
+                "The lady was just so so lovely. My feet are just gorgeous. "
+                "Would recommend to everyone."
+            ),
+            (
+                "My stay in Alila was wonderful. Special shout out to the staff "
+                "for making it memorable."
+            ),
+            (
+                "The hotel have a sense of peace and tranquility once step in. "
+                "The personal service was delicate."
+            ),
+            (
+                "The staffs also offered great recommendation for drinks based on "
+                "your preference."
+            ),
+            (
+                "Directions Save Nearby Send to phone Share About this data "
+                "Get the most out of Google Maps Sign in"
+            ),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"The lady was just so so lovely. My feet are just gorgeous. Would recommend to everyone.",
"My stay in Alila was wonderful. Special shout out to the staff for making it memorable.",
"The hotel have a sense of peace and tranquility once step in. The personal service was delicate.",
"The staffs also offered great recommendation for drinks based on your preference.",
"Directions Save Nearby Send to phone Share About this data Get the most out of Google Maps Sign in",
(
"The lady was just so so lovely. My feet are just gorgeous. "
"Would recommend to everyone."
),
(
"My stay in Alila was wonderful. Special shout out to the staff "
"for making it memorable."
),
(
"The hotel have a sense of peace and tranquility once step in. "
"The personal service was delicate."
),
(
"The staffs also offered great recommendation for drinks based on "
"your preference."
),
(
"Directions Save Nearby Send to phone Share About this data "
"Get the most out of Google Maps Sign in"
),
🧰 Tools
🪛 GitHub Actions: CI / 2_lint.txt

[error] 1541-1541: E501 Line too long (102 > 100) reported by lint (flake8/pycodestyle style check).


[error] 1542-1542: E501 Line too long (102 > 100) reported by lint (flake8/pycodestyle style check).


[error] 1543-1543: E501 Line too long (111 > 100) reported by lint (flake8/pycodestyle style check).


[error] 1545-1545: E501 Line too long (113 > 100) reported by lint (flake8/pycodestyle style check).

🪛 GitHub Actions: CI / lint

[error] 1541-1541: Lint (E501): Line too long (102 > 100).


[error] 1542-1542: Lint (E501): Line too long (102 > 100).


[error] 1543-1543: Lint (E501): Line too long (111 > 100).


[error] 1545-1545: Lint (E501): Line too long (113 > 100).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_place_scraper.py` around lines 1541 - 1545, Several long string
literals in the new test cases (e.g., "The lady was just so so lovely. My feet
are just gorgeous. Would recommend to everyone.", "My stay in Alila was
wonderful. Special shout out to the staff for making it memorable.", "The hotel
have a sense of peace and tranquility once step in. The personal service was
delicate.", and "Directions Save Nearby Send to phone Share About this data Get
the most out of Google Maps Sign in") exceed 100 characters causing E501
failures; fix by breaking each long literal into shorter pieces using implicit
string concatenation inside parentheses or by splitting into two quoted parts
joined with + so each source line is under 100 chars (locate the list/array of
test strings in tests/test_place_scraper.py by searching for any of the quoted
strings above and update those entries).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment on lines 138 to 168
_DESCRIPTION_REVIEW_PROSE_MARKERS = (
"boy was it worth",
"best place to stay",
"definitely recommend this place",
"great experience overall",
"had a great time",
"hidden gem-literally",
"i forgot his name",
"i'd just finished",
"i’d just finished",
"highly recommended",
"i've tasted",
"i’ve tasted",
"it was my first attempt",
"my stay in",
"once step in",
"offered great recommendation",
"omfg",
"overrated",
"so yummy",
"the katsu burger",
"the rooms were huge",
"we have ever had",
"we've ever had",
"what a great hotel",
"would recommend to everyone",
"about this data",
"get the most out of google maps",
"your children",
"your kids",
"you should",
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a749f22375

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +164 to +165
"about this data",
"get the most out of google maps",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move chrome markers out of the review-prose gate

These Google Maps chrome markers are added to _DESCRIPTION_REVIEW_PROSE_MARKERS, but _looks_like_description_review_prose() returns False for any value under 12 words before checking this list. When the page text is just the common combined footer phrase About this data Get the most out of Google Maps (9 words) rather than the longer fixture with Directions/Save prefixes, _clean_description_text() still accepts it as a description because it is not an exact stop-marker match. Put these UI/footer phrases in a substring stop list or check them before the review-prose length gate.

Useful? React with 👍 / 👎.

@michaelmwu michaelmwu force-pushed the codex/reject-ramen-review-description branch from a749f22 to f79b6a2 Compare May 12, 2026 11:45
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f79b6a241a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 2492 to +2494
panel_lines = _body_lines(snapshot.get("panel_text"))
body_lines = _body_lines(snapshot.get("body_text"))
search_lines = panel_lines or body_lines
combined_lines = _dedupe_lines([*panel_lines, *body_lines])
search_lines = panel_lines
combined_lines = _dedupe_lines(panel_lines)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve search-card panel lines for fallback extraction

When a /maps/search result is selected but opening the place page fails, _search_result_snapshot() still copies the card text only as body_text, while this change makes _build_place_details() ignore body_text and build all line-based fallbacks from panel_text only. In that fallback path the selected card's lines are therefore invisible to _extract_status_from_lines, _extract_phone_from_lines, _extract_plus_code_from_lines, etc., even though the search-result JS now emits panel_text; copy panel_text through or include the search-card body_text in this narrow path.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/gmaps_scraper/place_scraper.py`:
- Around line 4347-4350: The check currently only matches labels that start with
"sponsored " and misses the exact "sponsored" label; update the return condition
in the block that computes normalized_lookup (after normalized.casefold()) to
also treat an exact match by adding a check like normalized_lookup ==
"sponsored" (in addition to the existing startswith check) when testing against
_SEARCH_RESULTS_LABELS so both "sponsored" and values starting with "sponsored "
are considered search-result labels.
- Around line 1428-1443: The cardDescription function currently returns the
first matching row text, which can be UI-action tokens like
"Share"/"Save"/"Call" and prevent real description rows from being considered;
update the loop in cardDescription to treat short UI action tokens as
non-descriptive by continuing instead of returning: add a check against common
UI verbs/labels (e.g., /^((share|save|call|website|directions|get
directions|saved in|view menu|write a review)\b)/i or a small-word-length
heuristic) using the existing text variable and rows iteration, and only return
text when it does not match those UI-action patterns and looks like a real
description.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 37f8fa43-135d-40b4-b6af-2b0c5ecb44b4

📥 Commits

Reviewing files that changed from the base of the PR and between a749f22 and f79b6a2.

📒 Files selected for processing (2)
  • src/gmaps_scraper/place_scraper.py
  • tests/test_place_scraper.py

Comment on lines +1428 to +1443
if (
row.querySelector(
".AJB7ye, .UsdlK, [role='img'][aria-label*='star' i], a[href^='tel:']",
)
) {
continue;
}
if (/^(open|closed|temporarily closed|website|directions|saved in)\b/i.test(text)) {
return false;
continue;
}
if (/^[+()\d\s.-]{7,}$/.test(text)) {
return false;
continue;
}
return text.length >= 12;
}) || null;
return text;
}
return null;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid returning UI-action tokens from cardDescription before real description rows.

If the first matched row is "Share"/"Save"/"Call", this function returns it and the Python cleaner drops it later, so later valid description rows are never considered.

Suggested minimal fix
-      if (/^(open|closed|temporarily closed|website|directions|saved in)\b/i.test(text)) {
+      if (
+        /^(open|closed|temporarily closed|website|directions|saved in|call|save|saved|share|nearby|send to phone|sign in)\b/i.test(text)
+      ) {
         continue;
       }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/gmaps_scraper/place_scraper.py` around lines 1428 - 1443, The
cardDescription function currently returns the first matching row text, which
can be UI-action tokens like "Share"/"Save"/"Call" and prevent real description
rows from being considered; update the loop in cardDescription to treat short UI
action tokens as non-descriptive by continuing instead of returning: add a check
against common UI verbs/labels (e.g., /^((share|save|call|website|directions|get
directions|saved in|view menu|write a review)\b)/i or a small-word-length
heuristic) using the existing text variable and rows iteration, and only return
text when it does not match those UI-action patterns and looks like a real
description.

Comment on lines +4347 to +4350
normalized_lookup = normalized.casefold()
return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith(
"sponsored "
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Treat exact "Sponsored" as a search-result label too.

Line 4348 only catches values that start with "sponsored " (with a trailing space). A plain "Sponsored" label can still pass name/category cleaning.

Suggested fix
-    return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith(
-        "sponsored "
-    )
+    return (
+        normalized_lookup in _SEARCH_RESULTS_LABELS
+        or normalized_lookup == "sponsored"
+        or normalized_lookup.startswith("sponsored ")
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
normalized_lookup = normalized.casefold()
return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith(
"sponsored "
)
normalized_lookup = normalized.casefold()
return (
normalized_lookup in _SEARCH_RESULTS_LABELS
or normalized_lookup == "sponsored"
or normalized_lookup.startswith("sponsored ")
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/gmaps_scraper/place_scraper.py` around lines 4347 - 4350, The check
currently only matches labels that start with "sponsored " and misses the exact
"sponsored" label; update the return condition in the block that computes
normalized_lookup (after normalized.casefold()) to also treat an exact match by
adding a check like normalized_lookup == "sponsored" (in addition to the
existing startswith check) when testing against _SEARCH_RESULTS_LABELS so both
"sponsored" and values starting with "sponsored " are considered search-result
labels.

Copilot AI review requested due to automatic review settings May 12, 2026 16:10
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cc63505947

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/gmaps_scraper/parser.py Outdated
Comment on lines +647 to +648
if _parse_list_owner(value) is not None:
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve singleton CID lists

When a place metadata CID is stored in the common singleton-list shape like ["1234567890123456789"], this guard now treats that list as a list-owner payload because _parse_list_owner() accepts any 1-item list with a non-empty string as name. _find_cid() checks metadata index 6 through this path first, so those places lose their cid and fall back to query-only maps_url/dedupe behavior even though a valid CID is present.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment on lines 640 to +648
def _find_cid_in_value(value: JSONValue | None) -> str | None:
if isinstance(value, int):
return _normalize_cid_token(str(value))
if isinstance(value, str):
return _normalize_cid_token(value)
if not isinstance(value, list):
return None
owner = _parse_list_owner(value)
if owner is not None and (owner.photo_url is not None or owner.profile_id is not None):
Comment on lines +2493 to +2494
search_lines = panel_lines
combined_lines = _dedupe_lines(panel_lines)
Copilot AI review requested due to automatic review settings May 12, 2026 23:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment on lines +1417 to +1442
const cardDescription = (article, excludedValues) => {
const excluded = new Set(excludedValues.map(cleanLine).filter(Boolean));
return lines.find((line) => {
const text = cleanLine(line);
const rows = Array.from(article.querySelectorAll("div.W4Efsd"));
for (const row of rows) {
const text = cleanLine(row.innerText || row.textContent || "");
if (!text || excluded.has(text)) {
return false;
continue;
}
if (text.includes("·") || parseCardRating(text) || parseCardReviewCount(text)) {
return false;
continue;
}
if (
row.querySelector(
".AJB7ye, .UsdlK, [role='img'][aria-label*='star' i], a[href^='tel:']",
)
) {
continue;
}
if (/^(open|closed|temporarily closed|website|directions|saved in)\b/i.test(text)) {
return false;
continue;
}
if (/^[+()\d\s.-]{7,}$/.test(text)) {
return false;
continue;
}
return text.length >= 12;
}) || null;
return text;
}
Comment on lines +640 to +667
const candidates = [];
for (const element of panel.querySelectorAll("div, span")) {
const text = cleanLine(element.innerText || element.textContent || "");
if (!text || text.includes("·")) {
continue;
}
if (
element.closest(
"button, a, [role='button'], [data-item-id], [data-review-id], div.F7nice",
)
) {
continue;
}
if (
Array.from(element.children).some(
(child) => cleanLine(child.innerText || child.textContent || "") === text,
)
) {
continue;
}
const top = elementTop(element);
if (top === null || top <= titleBottom || top >= boundaryTop) {
continue;
}
candidates.push({top, text});
}
candidates.sort((left, right) => left.top - right.top);
return candidates[0]?.text || null;
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
src/gmaps_scraper/place_scraper.py (3)

4346-4353: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle exact Sponsored the same way as Sponsored ….

The new check still lets a plain "Sponsored" label through because it only tests startswith("sponsored ").

Suggested fix
-    return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith(
-        "sponsored "
-    )
+    return (
+        normalized_lookup in _SEARCH_RESULTS_LABELS
+        or normalized_lookup == "sponsored"
+        or normalized_lookup.startswith("sponsored ")
+    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/gmaps_scraper/place_scraper.py` around lines 4346 - 4353, The function
_looks_like_search_results_label currently lets a plain "Sponsored" pass because
it only tests normalized_lookup.startswith("sponsored "), so update the
condition inside _looks_like_search_results_label (which uses _clean_text and
_SEARCH_RESULTS_LABELS) to treat the exact "sponsored" the same as "sponsored …"
by checking either normalized_lookup == "sponsored" or
normalized_lookup.startswith("sponsored "), combined with the existing
membership test against _SEARCH_RESULTS_LABELS.

3832-3836: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize curly apostrophes before review-prose matching.

value.casefold() still misses inputs like we’ve ever had unless every marker is duplicated in both ASCII and curly forms. That lets targeted review prose through.

Suggested fix
 def _looks_like_description_review_prose(value: str) -> bool:
     if len(value.split()) < 12:
         return False
-    lowered = value.casefold()
+    lowered = value.casefold().replace("’", "'").replace("‘", "'")
     return any(marker in lowered for marker in _DESCRIPTION_REVIEW_PROSE_MARKERS)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/gmaps_scraper/place_scraper.py` around lines 3832 - 3836, The function
_looks_like_description_review_prose should normalize curly apostrophes before
doing casefold/matching; update it to replace Unicode curly apostrophes (e.g.,
U+2019 and U+2018) with the ASCII apostrophe (') on the input string (value =
value.replace("\u2019", "'").replace("\u2018", "'")) before calling casefold(),
then perform the existing marker check against _DESCRIPTION_REVIEW_PROSE_MARKERS
so markers like "we've" match even when the input contains curly apostrophes.

1419-1445: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip short UI-action rows before returning search_result_description.

If the first div.W4Efsd row is "Share", "Save", or "Call", this returns it immediately, the Python cleaner drops it later, and any real description row below never gets considered.

Suggested fix
-      if (/^(open|closed|temporarily closed|website|directions|saved in)\b/i.test(text)) {
+      if (
+        /^(open|closed|temporarily closed|website|directions|saved in|call|save|saved|share|nearby|send to phone|sign in)\b/i.test(text)
+      ) {
         continue;
       }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/gmaps_scraper/place_scraper.py` around lines 1419 - 1445, The
cardDescription routine returns short UI-action rows like "Share", "Save", or
"Call" before real descriptions; update cardDescription (used with cleanLine,
parseCardRating, parseCardReviewCount) to explicitly skip common UI-action
labels by adding a check (e.g., a case-insensitive regex for "share", "save",
"call", "directions", "website", "more", etc.) before returning text so those
tokens are continued over and the next non-UI descriptive row is returned for
search_result_description.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/gmaps_scraper/place_scraper.py`:
- Around line 4346-4353: The function _looks_like_search_results_label currently
lets a plain "Sponsored" pass because it only tests
normalized_lookup.startswith("sponsored "), so update the condition inside
_looks_like_search_results_label (which uses _clean_text and
_SEARCH_RESULTS_LABELS) to treat the exact "sponsored" the same as "sponsored …"
by checking either normalized_lookup == "sponsored" or
normalized_lookup.startswith("sponsored "), combined with the existing
membership test against _SEARCH_RESULTS_LABELS.
- Around line 3832-3836: The function _looks_like_description_review_prose
should normalize curly apostrophes before doing casefold/matching; update it to
replace Unicode curly apostrophes (e.g., U+2019 and U+2018) with the ASCII
apostrophe (') on the input string (value = value.replace("\u2019",
"'").replace("\u2018", "'")) before calling casefold(), then perform the
existing marker check against _DESCRIPTION_REVIEW_PROSE_MARKERS so markers like
"we've" match even when the input contains curly apostrophes.
- Around line 1419-1445: The cardDescription routine returns short UI-action
rows like "Share", "Save", or "Call" before real descriptions; update
cardDescription (used with cleanLine, parseCardRating, parseCardReviewCount) to
explicitly skip common UI-action labels by adding a check (e.g., a
case-insensitive regex for "share", "save", "call", "directions", "website",
"more", etc.) before returning text so those tokens are continued over and the
next non-UI descriptive row is returned for search_result_description.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ce5aad73-7396-4468-a684-de53d766191b

📥 Commits

Reviewing files that changed from the base of the PR and between f79b6a2 and b569056.

📒 Files selected for processing (4)
  • src/gmaps_scraper/parser.py
  • src/gmaps_scraper/place_scraper.py
  • tests/test_parser.py
  • tests/test_place_scraper.py

@michaelmwu michaelmwu changed the title Reject first-person review descriptions Harden Google Maps place extraction May 13, 2026
@michaelmwu michaelmwu merged commit 8a94e9c into main May 13, 2026
4 checks passed
@michaelmwu michaelmwu deleted the codex/reject-ramen-review-description branch May 13, 2026 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants