Harden Google Maps place extraction#37
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRefined description selection and cleaning: expanded review-prose markers and UI-action-cluster detection, rewrote place-page and search-card extractors to select descriptionValue(), tightened photo selectors/URL normalization, simplified Python extraction/fallbacks, and updated tests and snapshots. ChangesPlace extraction & cleaning
Parser owner-payload guard
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/gmaps_scraper/place_scraper.py`:
- Around line 139-143: The matching fails for curly apostrophes (e.g., “we’ve
ever had”) because the review-prose markers list contains straight-apostrophe
strings like "we've ever had"; before running the marker checks (the code that
iterates over strings such as "we've ever had", "we have ever had", "we've ever
had", "overrated", etc.) normalize the review text (or both the text and
markers) by replacing curly/apostrophe variants (e.g., \u2019, \u2018) with the
ASCII apostrophe (') and any similar typographic quotes so that "we’ve ever had"
matches "we've ever had".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b8399c54-fd11-40c5-9e46-9693ab22485f
📒 Files selected for processing (2)
src/gmaps_scraper/place_scraper.pytests/test_place_scraper.py
| "great experience overall", | ||
| "highly recommended", | ||
| "overrated", | ||
| "we have ever had", | ||
| "we've ever had", |
There was a problem hiding this comment.
Normalize apostrophes before matching review-prose markers.
At Line 143, "we've ever had" won’t match curly-apostrophe text (we’ve ever had), so first-person review prose can still pass through.
Suggested fix
def _looks_like_description_review_prose(value: str) -> bool:
if len(value.split()) < 12:
return False
- lowered = value.casefold()
+ lowered = value.casefold().replace("’", "'")
return any(marker in lowered for marker in _DESCRIPTION_REVIEW_PROSE_MARKERS)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/gmaps_scraper/place_scraper.py` around lines 139 - 143, The matching
fails for curly apostrophes (e.g., “we’ve ever had”) because the review-prose
markers list contains straight-apostrophe strings like "we've ever had"; before
running the marker checks (the code that iterates over strings such as "we've
ever had", "we have ever had", "we've ever had", "overrated", etc.) normalize
the review text (or both the text and markers) by replacing curly/apostrophe
variants (e.g., \u2019, \u2018) with the ASCII apostrophe (') and any similar
typographic quotes so that "we’ve ever had" matches "we've ever had".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 09ed5db706
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "highly recommended", | ||
| "overrated", | ||
| "we have ever had", | ||
| "we've ever had", |
There was a problem hiding this comment.
Normalize curly apostrophes for review-prose markers
When the leaked review uses a typographic apostrophe, e.g. Best ramen we’ve ever had ..., this new marker will not match because _looks_like_description_review_prose() only casefold()s the text and does not replace ’ with ' like _looks_like_review_response_text() does. Since had is not one of the first-person experience verbs, a 12+ word review with we’ve ever had and no other marker can still be accepted as a place description, leaving the class of leak this change is meant to reject.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This pull request tightens description cleaning in place_scraper to prevent first-person / review-style prose from being treated as a place description, and adds a regression test for a ramen-class review snippet that previously leaked through.
Changes:
- Extend the review-prose marker list to reject additional review-like phrases (e.g., “great experience overall”, “we’ve ever had”).
- Add a unit test asserting that a ramen-class review-style description is rejected by
_clean_description_text().
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/gmaps_scraper/place_scraper.py |
Adds new review-prose substring markers used to reject leaked review-style descriptions. |
tests/test_place_scraper.py |
Adds a regression test ensuring a “Best ramen we’ve ever had…”-style snippet is rejected. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "highly recommended", | ||
| "overrated", | ||
| "we have ever had", | ||
| "we've ever had", |
| "(the noodles at least). Everything tastes better when you do it yourself! " | ||
| "Date day for a Saturday morning class. Great experience overall." | ||
| ) | ||
| ) |
09ed5db to
fda07a5
Compare
fda07a5 to
7061ef2
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/gmaps_scraper/place_scraper.py (1)
3748-3752:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winNormalize curly apostrophes before review-prose marker matching
At Line 3751, matching is case-folded but not apostrophe-normalized, so curly-apostrophe text can bypass straight-apostrophe markers (for example,
we’vevswe've).Suggested fix
def _looks_like_description_review_prose(value: str) -> bool: if len(value.split()) < 12: return False - lowered = value.casefold() + lowered = value.casefold().replace("’", "'") return any(marker in lowered for marker in _DESCRIPTION_REVIEW_PROSE_MARKERS)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/gmaps_scraper/place_scraper.py` around lines 3748 - 3752, The function _looks_like_description_review_prose currently casefolds the input but doesn't normalize curly apostrophes, so markers in _DESCRIPTION_REVIEW_PROSE_MARKERS like "we've" won't match "we’ve"; modify _looks_like_description_review_prose to first replace curly apostrophes (e.g., U+2018/U+2019) with the straight ASCII apostrophe, then casefold that normalized string before running the any(marker in lowered) test so marker matching succeeds for both curly and straight apostrophes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/gmaps_scraper/place_scraper.py`:
- Around line 3748-3752: The function _looks_like_description_review_prose
currently casefolds the input but doesn't normalize curly apostrophes, so
markers in _DESCRIPTION_REVIEW_PROSE_MARKERS like "we've" won't match "we’ve";
modify _looks_like_description_review_prose to first replace curly apostrophes
(e.g., U+2018/U+2019) with the straight ASCII apostrophe, then casefold that
normalized string before running the any(marker in lowered) test so marker
matching succeeds for both curly and straight apostrophes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cd966b13-2c3e-4591-9766-8e3924fcb1ce
📒 Files selected for processing (2)
src/gmaps_scraper/place_scraper.pytests/test_place_scraper.py
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/gmaps_scraper/place_scraper.py:162
_looks_like_description_review_prose()usesvalue.casefold()without normalizing typographic apostrophes (e.g.’). Because the marker list mixes ASCII and curly-quote variants (e.g. bothi'd...andi’d...) but only has ASCII for others (e.g.we've ever had), some real-world strings likewe’ve ever hadwill not match and will slip through. Consider normalizing the input (and/or markers) the same way_looks_like_review_response_text()does (e.g. replace’with'before searching) so you can keep a single canonical marker set and avoid missed matches.
_DESCRIPTION_REVIEW_PROSE_MARKERS = (
"boy was it worth",
"best place to stay",
"definitely recommend this place",
"great experience overall",
"hidden gem-literally",
"i forgot his name",
"i'd just finished",
"i’d just finished",
"highly recommended",
"i've tasted",
"i’ve tasted",
"it was my first attempt",
"omfg",
"overrated",
"so yummy",
"the katsu burger",
"the rooms were huge",
"we have ever had",
"we've ever had",
"what a great hotel",
"your children",
"your kids",
"you should",
)
| ) | ||
| ) | ||
| self.assertIsNone( | ||
| _clean_description_text( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7061ef2501
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "omfg", | ||
| "overrated", | ||
| "so yummy", | ||
| "the katsu burger", |
There was a problem hiding this comment.
Avoid rejecting dish-name descriptions
When a legitimate place description mentions a katsu burger in 12+ words, _looks_like_description_review_prose() now discards the entire description solely because it contains the katsu burger. The review fixture this targets would already be rejected by the neighboring omfg / so yummy markers, but this standalone dish-name marker also matches normal restaurant copy such as a description of a signature menu item, causing valid description output to be lost for those places.
Useful? React with 👍 / 👎.
7061ef2 to
468ab3a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 468ab3a0f0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ) | ||
| _DESCRIPTION_REVIEW_PROSE_MARKERS = ( | ||
| "boy was it worth", | ||
| "best place to stay", |
There was a problem hiding this comment.
Avoid dropping hotel marketing descriptions
Because _looks_like_description_review_prose() applies these markers to any 12+ word description without requiring a first-person pronoun, a legitimate hotel summary such as “The best place to stay in Hanoi for families, with spacious rooms and a central location” is now discarded solely due to this substring. That removes valid description output for affected lodging pages rather than just filtering leaked reviews.
Useful? React with 👍 / 👎.
468ab3a to
a749f22
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/test_place_scraper.py`:
- Around line 1541-1545: Several long string literals in the new test cases
(e.g., "The lady was just so so lovely. My feet are just gorgeous. Would
recommend to everyone.", "My stay in Alila was wonderful. Special shout out to
the staff for making it memorable.", "The hotel have a sense of peace and
tranquility once step in. The personal service was delicate.", and "Directions
Save Nearby Send to phone Share About this data Get the most out of Google Maps
Sign in") exceed 100 characters causing E501 failures; fix by breaking each long
literal into shorter pieces using implicit string concatenation inside
parentheses or by splitting into two quoted parts joined with + so each source
line is under 100 chars (locate the list/array of test strings in
tests/test_place_scraper.py by searching for any of the quoted strings above and
update those entries).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: dc334091-d552-44c9-9021-fc836c89e07a
📒 Files selected for processing (2)
src/gmaps_scraper/place_scraper.pytests/test_place_scraper.py
| "The lady was just so so lovely. My feet are just gorgeous. Would recommend to everyone.", | ||
| "My stay in Alila was wonderful. Special shout out to the staff for making it memorable.", | ||
| "The hotel have a sense of peace and tranquility once step in. The personal service was delicate.", | ||
| "The staffs also offered great recommendation for drinks based on your preference.", | ||
| "Directions Save Nearby Send to phone Share About this data Get the most out of Google Maps Sign in", |
There was a problem hiding this comment.
Fix E501 line-length failures in the new looped test cases
CI is failing on Line 1541, Line 1542, Line 1543, and Line 1545 due to >100-char literals.
Suggested minimal formatting fix
- "The lady was just so so lovely. My feet are just gorgeous. Would recommend to everyone.",
- "My stay in Alila was wonderful. Special shout out to the staff for making it memorable.",
- "The hotel have a sense of peace and tranquility once step in. The personal service was delicate.",
- "The staffs also offered great recommendation for drinks based on your preference.",
- "Directions Save Nearby Send to phone Share About this data Get the most out of Google Maps Sign in",
+ (
+ "The lady was just so so lovely. My feet are just gorgeous. "
+ "Would recommend to everyone."
+ ),
+ (
+ "My stay in Alila was wonderful. Special shout out to the staff "
+ "for making it memorable."
+ ),
+ (
+ "The hotel have a sense of peace and tranquility once step in. "
+ "The personal service was delicate."
+ ),
+ (
+ "The staffs also offered great recommendation for drinks based on "
+ "your preference."
+ ),
+ (
+ "Directions Save Nearby Send to phone Share About this data "
+ "Get the most out of Google Maps Sign in"
+ ),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "The lady was just so so lovely. My feet are just gorgeous. Would recommend to everyone.", | |
| "My stay in Alila was wonderful. Special shout out to the staff for making it memorable.", | |
| "The hotel have a sense of peace and tranquility once step in. The personal service was delicate.", | |
| "The staffs also offered great recommendation for drinks based on your preference.", | |
| "Directions Save Nearby Send to phone Share About this data Get the most out of Google Maps Sign in", | |
| ( | |
| "The lady was just so so lovely. My feet are just gorgeous. " | |
| "Would recommend to everyone." | |
| ), | |
| ( | |
| "My stay in Alila was wonderful. Special shout out to the staff " | |
| "for making it memorable." | |
| ), | |
| ( | |
| "The hotel have a sense of peace and tranquility once step in. " | |
| "The personal service was delicate." | |
| ), | |
| ( | |
| "The staffs also offered great recommendation for drinks based on " | |
| "your preference." | |
| ), | |
| ( | |
| "Directions Save Nearby Send to phone Share About this data " | |
| "Get the most out of Google Maps Sign in" | |
| ), |
🧰 Tools
🪛 GitHub Actions: CI / 2_lint.txt
[error] 1541-1541: E501 Line too long (102 > 100) reported by lint (flake8/pycodestyle style check).
[error] 1542-1542: E501 Line too long (102 > 100) reported by lint (flake8/pycodestyle style check).
[error] 1543-1543: E501 Line too long (111 > 100) reported by lint (flake8/pycodestyle style check).
[error] 1545-1545: E501 Line too long (113 > 100) reported by lint (flake8/pycodestyle style check).
🪛 GitHub Actions: CI / lint
[error] 1541-1541: Lint (E501): Line too long (102 > 100).
[error] 1542-1542: Lint (E501): Line too long (102 > 100).
[error] 1543-1543: Lint (E501): Line too long (111 > 100).
[error] 1545-1545: Lint (E501): Line too long (113 > 100).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_place_scraper.py` around lines 1541 - 1545, Several long string
literals in the new test cases (e.g., "The lady was just so so lovely. My feet
are just gorgeous. Would recommend to everyone.", "My stay in Alila was
wonderful. Special shout out to the staff for making it memorable.", "The hotel
have a sense of peace and tranquility once step in. The personal service was
delicate.", and "Directions Save Nearby Send to phone Share About this data Get
the most out of Google Maps Sign in") exceed 100 characters causing E501
failures; fix by breaking each long literal into shorter pieces using implicit
string concatenation inside parentheses or by splitting into two quoted parts
joined with + so each source line is under 100 chars (locate the list/array of
test strings in tests/test_place_scraper.py by searching for any of the quoted
strings above and update those entries).
| _DESCRIPTION_REVIEW_PROSE_MARKERS = ( | ||
| "boy was it worth", | ||
| "best place to stay", | ||
| "definitely recommend this place", | ||
| "great experience overall", | ||
| "had a great time", | ||
| "hidden gem-literally", | ||
| "i forgot his name", | ||
| "i'd just finished", | ||
| "i’d just finished", | ||
| "highly recommended", | ||
| "i've tasted", | ||
| "i’ve tasted", | ||
| "it was my first attempt", | ||
| "my stay in", | ||
| "once step in", | ||
| "offered great recommendation", | ||
| "omfg", | ||
| "overrated", | ||
| "so yummy", | ||
| "the katsu burger", | ||
| "the rooms were huge", | ||
| "we have ever had", | ||
| "we've ever had", | ||
| "what a great hotel", | ||
| "would recommend to everyone", | ||
| "about this data", | ||
| "get the most out of google maps", | ||
| "your children", | ||
| "your kids", | ||
| "you should", |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a749f22375
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "about this data", | ||
| "get the most out of google maps", |
There was a problem hiding this comment.
Move chrome markers out of the review-prose gate
These Google Maps chrome markers are added to _DESCRIPTION_REVIEW_PROSE_MARKERS, but _looks_like_description_review_prose() returns False for any value under 12 words before checking this list. When the page text is just the common combined footer phrase About this data Get the most out of Google Maps (9 words) rather than the longer fixture with Directions/Save prefixes, _clean_description_text() still accepts it as a description because it is not an exact stop-marker match. Put these UI/footer phrases in a substring stop list or check them before the review-prose length gate.
Useful? React with 👍 / 👎.
a749f22 to
f79b6a2
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f79b6a241a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| panel_lines = _body_lines(snapshot.get("panel_text")) | ||
| body_lines = _body_lines(snapshot.get("body_text")) | ||
| search_lines = panel_lines or body_lines | ||
| combined_lines = _dedupe_lines([*panel_lines, *body_lines]) | ||
| search_lines = panel_lines | ||
| combined_lines = _dedupe_lines(panel_lines) |
There was a problem hiding this comment.
Preserve search-card panel lines for fallback extraction
When a /maps/search result is selected but opening the place page fails, _search_result_snapshot() still copies the card text only as body_text, while this change makes _build_place_details() ignore body_text and build all line-based fallbacks from panel_text only. In that fallback path the selected card's lines are therefore invisible to _extract_status_from_lines, _extract_phone_from_lines, _extract_plus_code_from_lines, etc., even though the search-result JS now emits panel_text; copy panel_text through or include the search-card body_text in this narrow path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/gmaps_scraper/place_scraper.py`:
- Around line 4347-4350: The check currently only matches labels that start with
"sponsored " and misses the exact "sponsored" label; update the return condition
in the block that computes normalized_lookup (after normalized.casefold()) to
also treat an exact match by adding a check like normalized_lookup ==
"sponsored" (in addition to the existing startswith check) when testing against
_SEARCH_RESULTS_LABELS so both "sponsored" and values starting with "sponsored "
are considered search-result labels.
- Around line 1428-1443: The cardDescription function currently returns the
first matching row text, which can be UI-action tokens like
"Share"/"Save"/"Call" and prevent real description rows from being considered;
update the loop in cardDescription to treat short UI action tokens as
non-descriptive by continuing instead of returning: add a check against common
UI verbs/labels (e.g., /^((share|save|call|website|directions|get
directions|saved in|view menu|write a review)\b)/i or a small-word-length
heuristic) using the existing text variable and rows iteration, and only return
text when it does not match those UI-action patterns and looks like a real
description.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 37f8fa43-135d-40b4-b6af-2b0c5ecb44b4
📒 Files selected for processing (2)
src/gmaps_scraper/place_scraper.pytests/test_place_scraper.py
| if ( | ||
| row.querySelector( | ||
| ".AJB7ye, .UsdlK, [role='img'][aria-label*='star' i], a[href^='tel:']", | ||
| ) | ||
| ) { | ||
| continue; | ||
| } | ||
| if (/^(open|closed|temporarily closed|website|directions|saved in)\b/i.test(text)) { | ||
| return false; | ||
| continue; | ||
| } | ||
| if (/^[+()\d\s.-]{7,}$/.test(text)) { | ||
| return false; | ||
| continue; | ||
| } | ||
| return text.length >= 12; | ||
| }) || null; | ||
| return text; | ||
| } | ||
| return null; |
There was a problem hiding this comment.
Avoid returning UI-action tokens from cardDescription before real description rows.
If the first matched row is "Share"/"Save"/"Call", this function returns it and the Python cleaner drops it later, so later valid description rows are never considered.
Suggested minimal fix
- if (/^(open|closed|temporarily closed|website|directions|saved in)\b/i.test(text)) {
+ if (
+ /^(open|closed|temporarily closed|website|directions|saved in|call|save|saved|share|nearby|send to phone|sign in)\b/i.test(text)
+ ) {
continue;
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/gmaps_scraper/place_scraper.py` around lines 1428 - 1443, The
cardDescription function currently returns the first matching row text, which
can be UI-action tokens like "Share"/"Save"/"Call" and prevent real description
rows from being considered; update the loop in cardDescription to treat short UI
action tokens as non-descriptive by continuing instead of returning: add a check
against common UI verbs/labels (e.g., /^((share|save|call|website|directions|get
directions|saved in|view menu|write a review)\b)/i or a small-word-length
heuristic) using the existing text variable and rows iteration, and only return
text when it does not match those UI-action patterns and looks like a real
description.
| normalized_lookup = normalized.casefold() | ||
| return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith( | ||
| "sponsored " | ||
| ) |
There was a problem hiding this comment.
Treat exact "Sponsored" as a search-result label too.
Line 4348 only catches values that start with "sponsored " (with a trailing space). A plain "Sponsored" label can still pass name/category cleaning.
Suggested fix
- return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith(
- "sponsored "
- )
+ return (
+ normalized_lookup in _SEARCH_RESULTS_LABELS
+ or normalized_lookup == "sponsored"
+ or normalized_lookup.startswith("sponsored ")
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| normalized_lookup = normalized.casefold() | |
| return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith( | |
| "sponsored " | |
| ) | |
| normalized_lookup = normalized.casefold() | |
| return ( | |
| normalized_lookup in _SEARCH_RESULTS_LABELS | |
| or normalized_lookup == "sponsored" | |
| or normalized_lookup.startswith("sponsored ") | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/gmaps_scraper/place_scraper.py` around lines 4347 - 4350, The check
currently only matches labels that start with "sponsored " and misses the exact
"sponsored" label; update the return condition in the block that computes
normalized_lookup (after normalized.casefold()) to also treat an exact match by
adding a check like normalized_lookup == "sponsored" (in addition to the
existing startswith check) when testing against _SEARCH_RESULTS_LABELS so both
"sponsored" and values starting with "sponsored " are considered search-result
labels.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cc63505947
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if _parse_list_owner(value) is not None: | ||
| return None |
There was a problem hiding this comment.
When a place metadata CID is stored in the common singleton-list shape like ["1234567890123456789"], this guard now treats that list as a list-owner payload because _parse_list_owner() accepts any 1-item list with a non-empty string as name. _find_cid() checks metadata index 6 through this path first, so those places lose their cid and fall back to query-only maps_url/dedupe behavior even though a valid CID is present.
Useful? React with 👍 / 👎.
| def _find_cid_in_value(value: JSONValue | None) -> str | None: | ||
| if isinstance(value, int): | ||
| return _normalize_cid_token(str(value)) | ||
| if isinstance(value, str): | ||
| return _normalize_cid_token(value) | ||
| if not isinstance(value, list): | ||
| return None | ||
| owner = _parse_list_owner(value) | ||
| if owner is not None and (owner.photo_url is not None or owner.profile_id is not None): |
| search_lines = panel_lines | ||
| combined_lines = _dedupe_lines(panel_lines) |
| const cardDescription = (article, excludedValues) => { | ||
| const excluded = new Set(excludedValues.map(cleanLine).filter(Boolean)); | ||
| return lines.find((line) => { | ||
| const text = cleanLine(line); | ||
| const rows = Array.from(article.querySelectorAll("div.W4Efsd")); | ||
| for (const row of rows) { | ||
| const text = cleanLine(row.innerText || row.textContent || ""); | ||
| if (!text || excluded.has(text)) { | ||
| return false; | ||
| continue; | ||
| } | ||
| if (text.includes("·") || parseCardRating(text) || parseCardReviewCount(text)) { | ||
| return false; | ||
| continue; | ||
| } | ||
| if ( | ||
| row.querySelector( | ||
| ".AJB7ye, .UsdlK, [role='img'][aria-label*='star' i], a[href^='tel:']", | ||
| ) | ||
| ) { | ||
| continue; | ||
| } | ||
| if (/^(open|closed|temporarily closed|website|directions|saved in)\b/i.test(text)) { | ||
| return false; | ||
| continue; | ||
| } | ||
| if (/^[+()\d\s.-]{7,}$/.test(text)) { | ||
| return false; | ||
| continue; | ||
| } | ||
| return text.length >= 12; | ||
| }) || null; | ||
| return text; | ||
| } |
| const candidates = []; | ||
| for (const element of panel.querySelectorAll("div, span")) { | ||
| const text = cleanLine(element.innerText || element.textContent || ""); | ||
| if (!text || text.includes("·")) { | ||
| continue; | ||
| } | ||
| if ( | ||
| element.closest( | ||
| "button, a, [role='button'], [data-item-id], [data-review-id], div.F7nice", | ||
| ) | ||
| ) { | ||
| continue; | ||
| } | ||
| if ( | ||
| Array.from(element.children).some( | ||
| (child) => cleanLine(child.innerText || child.textContent || "") === text, | ||
| ) | ||
| ) { | ||
| continue; | ||
| } | ||
| const top = elementTop(element); | ||
| if (top === null || top <= titleBottom || top >= boundaryTop) { | ||
| continue; | ||
| } | ||
| candidates.push({top, text}); | ||
| } | ||
| candidates.sort((left, right) => left.top - right.top); | ||
| return candidates[0]?.text || null; |
There was a problem hiding this comment.
♻️ Duplicate comments (3)
src/gmaps_scraper/place_scraper.py (3)
4346-4353:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winHandle exact
Sponsoredthe same way asSponsored ….The new check still lets a plain
"Sponsored"label through because it only testsstartswith("sponsored ").Suggested fix
- return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith( - "sponsored " - ) + return ( + normalized_lookup in _SEARCH_RESULTS_LABELS + or normalized_lookup == "sponsored" + or normalized_lookup.startswith("sponsored ") + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/gmaps_scraper/place_scraper.py` around lines 4346 - 4353, The function _looks_like_search_results_label currently lets a plain "Sponsored" pass because it only tests normalized_lookup.startswith("sponsored "), so update the condition inside _looks_like_search_results_label (which uses _clean_text and _SEARCH_RESULTS_LABELS) to treat the exact "sponsored" the same as "sponsored …" by checking either normalized_lookup == "sponsored" or normalized_lookup.startswith("sponsored "), combined with the existing membership test against _SEARCH_RESULTS_LABELS.
3832-3836:⚠️ Potential issue | 🟠 Major | ⚡ Quick winNormalize curly apostrophes before review-prose matching.
value.casefold()still misses inputs likewe’ve ever hadunless every marker is duplicated in both ASCII and curly forms. That lets targeted review prose through.Suggested fix
def _looks_like_description_review_prose(value: str) -> bool: if len(value.split()) < 12: return False - lowered = value.casefold() + lowered = value.casefold().replace("’", "'").replace("‘", "'") return any(marker in lowered for marker in _DESCRIPTION_REVIEW_PROSE_MARKERS)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/gmaps_scraper/place_scraper.py` around lines 3832 - 3836, The function _looks_like_description_review_prose should normalize curly apostrophes before doing casefold/matching; update it to replace Unicode curly apostrophes (e.g., U+2019 and U+2018) with the ASCII apostrophe (') on the input string (value = value.replace("\u2019", "'").replace("\u2018", "'")) before calling casefold(), then perform the existing marker check against _DESCRIPTION_REVIEW_PROSE_MARKERS so markers like "we've" match even when the input contains curly apostrophes.
1419-1445:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSkip short UI-action rows before returning
search_result_description.If the first
div.W4Efsdrow is"Share","Save", or"Call", this returns it immediately, the Python cleaner drops it later, and any real description row below never gets considered.Suggested fix
- if (/^(open|closed|temporarily closed|website|directions|saved in)\b/i.test(text)) { + if ( + /^(open|closed|temporarily closed|website|directions|saved in|call|save|saved|share|nearby|send to phone|sign in)\b/i.test(text) + ) { continue; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/gmaps_scraper/place_scraper.py` around lines 1419 - 1445, The cardDescription routine returns short UI-action rows like "Share", "Save", or "Call" before real descriptions; update cardDescription (used with cleanLine, parseCardRating, parseCardReviewCount) to explicitly skip common UI-action labels by adding a check (e.g., a case-insensitive regex for "share", "save", "call", "directions", "website", "more", etc.) before returning text so those tokens are continued over and the next non-UI descriptive row is returned for search_result_description.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/gmaps_scraper/place_scraper.py`:
- Around line 4346-4353: The function _looks_like_search_results_label currently
lets a plain "Sponsored" pass because it only tests
normalized_lookup.startswith("sponsored "), so update the condition inside
_looks_like_search_results_label (which uses _clean_text and
_SEARCH_RESULTS_LABELS) to treat the exact "sponsored" the same as "sponsored …"
by checking either normalized_lookup == "sponsored" or
normalized_lookup.startswith("sponsored "), combined with the existing
membership test against _SEARCH_RESULTS_LABELS.
- Around line 3832-3836: The function _looks_like_description_review_prose
should normalize curly apostrophes before doing casefold/matching; update it to
replace Unicode curly apostrophes (e.g., U+2019 and U+2018) with the ASCII
apostrophe (') on the input string (value = value.replace("\u2019",
"'").replace("\u2018", "'")) before calling casefold(), then perform the
existing marker check against _DESCRIPTION_REVIEW_PROSE_MARKERS so markers like
"we've" match even when the input contains curly apostrophes.
- Around line 1419-1445: The cardDescription routine returns short UI-action
rows like "Share", "Save", or "Call" before real descriptions; update
cardDescription (used with cleanLine, parseCardRating, parseCardReviewCount) to
explicitly skip common UI-action labels by adding a check (e.g., a
case-insensitive regex for "share", "save", "call", "directions", "website",
"more", etc.) before returning text so those tokens are continued over and the
next non-UI descriptive row is returned for search_result_description.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ce5aad73-7396-4468-a684-de53d766191b
📒 Files selected for processing (4)
src/gmaps_scraper/parser.pysrc/gmaps_scraper/place_scraper.pytests/test_parser.pytests/test_place_scraper.py
Summary
Validation