Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions src/gmaps_scraper/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -644,6 +644,9 @@ def _find_cid_in_value(value: JSONValue | None) -> str | None:
return _normalize_cid_token(value)
if not isinstance(value, list):
return None
owner = _parse_list_owner(value)
if owner is not None and (owner.photo_url is not None or owner.profile_id is not None):
return None

numeric_texts = [
text
Expand Down
193 changes: 151 additions & 42 deletions src/gmaps_scraper/place_scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,13 +99,21 @@
"share",
"website",
}
_UI_ACTION_CLUSTER_LABEL_PATTERN = (
r"(?:call|directions|save|saved|nearby|send to phone|share|website|sign in)"
)
_UI_ACTION_CLUSTER_RE = re.compile(
rf"^{_UI_ACTION_CLUSTER_LABEL_PATTERN}(?:\s+{_UI_ACTION_CLUSTER_LABEL_PATTERN}){{2,}}$",
re.IGNORECASE,
)
_DESCRIPTION_STOP_MARKERS = {
"photos",
"about this data",
"sponsored",
"write a review",
"claim this business",
"suggest an edit",
"overview reviews about",
"limited view of google maps",
"get the most out of google maps",
"our policies do not permit contributions to this type of place.",
Expand Down Expand Up @@ -136,8 +144,33 @@
"i'm sorry to inform",
)
_DESCRIPTION_REVIEW_PROSE_MARKERS = (
"best place to stay",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid dropping hotel marketing descriptions

Because _looks_like_description_review_prose() applies these markers to any 12+ word description without requiring a first-person pronoun, a legitimate hotel summary such as “The best place to stay in Hanoi for families, with spacious rooms and a central location” is now discarded solely due to this substring. That removes valid description output for affected lodging pages rather than just filtering leaked reviews.

Useful? React with 👍 / 👎.

"boy was it worth",
"definitely recommend this place",
"great experience overall",
"had a great time",
"hidden gem-literally",
"highly recommended",
"i forgot his name",
"i'd just finished",
"i've tasted",
"i’d just finished",
"i’ve tasted",
"it was my first attempt",
"my stay in",
"once step in",
"offered great recommendation",
"omfg",
"overrated",
"so yummy",
"the katsu burger",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid rejecting dish-name descriptions

When a legitimate place description mentions a katsu burger in 12+ words, _looks_like_description_review_prose() now discards the entire description solely because it contains the katsu burger. The review fixture this targets would already be rejected by the neighboring omfg / so yummy markers, but this standalone dish-name marker also matches normal restaurant copy such as a description of a signature menu item, causing valid description output to be lost for those places.

Useful? React with 👍 / 👎.

"the rooms were huge",
"we have ever had",
"we've ever had",
Comment on lines +150 to +169
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize apostrophes before matching review-prose markers.

At Line 143, "we've ever had" won’t match curly-apostrophe text (we’ve ever had), so first-person review prose can still pass through.

Suggested fix
 def _looks_like_description_review_prose(value: str) -> bool:
     if len(value.split()) < 12:
         return False
-    lowered = value.casefold()
+    lowered = value.casefold().replace("’", "'")
     return any(marker in lowered for marker in _DESCRIPTION_REVIEW_PROSE_MARKERS)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/gmaps_scraper/place_scraper.py` around lines 139 - 143, The matching
fails for curly apostrophes (e.g., “we’ve ever had”) because the review-prose
markers list contains straight-apostrophe strings like "we've ever had"; before
running the marker checks (the code that iterates over strings such as "we've
ever had", "we have ever had", "we've ever had", "overrated", etc.) normalize
the review text (or both the text and markers) by replacing curly/apostrophe
variants (e.g., \u2019, \u2018) with the ASCII apostrophe (') and any similar
typographic quotes so that "we’ve ever had" matches "we've ever had".

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize curly apostrophes for review-prose markers

When the leaked review uses a typographic apostrophe, e.g. Best ramen we’ve ever had ..., this new marker will not match because _looks_like_description_review_prose() only casefold()s the text and does not replace with ' like _looks_like_review_response_text() does. Since had is not one of the first-person experience verbs, a 12+ word review with we’ve ever had and no other marker can still be accepted as a place description, leaving the class of leak this change is meant to reject.

Useful? React with 👍 / 👎.

"what a great hotel",
"would recommend to everyone",
"about this data",
"get the most out of google maps",
Comment on lines +172 to +173
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move chrome markers out of the review-prose gate

These Google Maps chrome markers are added to _DESCRIPTION_REVIEW_PROSE_MARKERS, but _looks_like_description_review_prose() returns False for any value under 12 words before checking this list. When the page text is just the common combined footer phrase About this data Get the most out of Google Maps (9 words) rather than the longer fixture with Directions/Save prefixes, _clean_description_text() still accepts it as a description because it is not an exact stop-marker match. Put these UI/footer phrases in a substring stop list or check them before the review-prose length gate.

Useful? React with 👍 / 👎.

"your children",
"your kids",
"you should",
Comment on lines 146 to 176
Expand Down Expand Up @@ -573,6 +606,68 @@
}
return null;
};
const elementTop = (element) => {
const rect = element?.getBoundingClientRect?.();
return rect && rect.height > 0 ? rect.top : null;
};
const elementBottom = (element) => {
const rect = element?.getBoundingClientRect?.();
return rect && rect.height > 0 ? rect.bottom : null;
};
const descriptionBoundaryTop = () => {
const rows = Array.from(panel.querySelectorAll("[data-item-id]"))
.map(elementTop)
.filter((value) => value !== null);
if (rows.length > 0) {
return Math.min(...rows);
}
const addressRow = addressRowElement();
const addressTop = elementTop(addressRow);
return addressTop === null ? Infinity : addressTop;
};
const descriptionValue = () => {
const direct = firstText([".WeS02d", ".PYvSYb"]);
if (direct) {
return direct;
}
const titleBottom = Math.max(
...[
elementBottom(titleElement),
...Array.from(panel.querySelectorAll("div.F7nice")).map(elementBottom),
].filter((value) => value !== null),
0,
);
const boundaryTop = descriptionBoundaryTop();
const candidates = [];
for (const element of panel.querySelectorAll("div, span")) {
const text = cleanLine(element.innerText || element.textContent || "");
if (!text || text.includes("·")) {
continue;
}
if (
element.closest(
"button, a, [role='button'], [role='tab'], [role='tablist'], "
+ "[data-item-id], [data-review-id], div.F7nice",
)
) {
continue;
}
if (
Array.from(element.children).some(
(child) => cleanLine(child.innerText || child.textContent || "") === text,
)
) {
continue;
}
const top = elementTop(element);
if (top === null || top <= titleBottom || top >= boundaryTop) {
continue;
}
candidates.push({top, text});
}
candidates.sort((left, right) => left.top - right.top);
return candidates[0]?.text || null;
Comment on lines +641 to +669
};

const normalizeCount = (value) => {
if (!value) {
Expand Down Expand Up @@ -683,22 +778,19 @@
}

const mainPhotoUrl = firstImageUrl([
"div.RZ66Rb button[jsaction*='heroHeaderImage'] img",
"button[jsaction*='heroHeaderImage'] img",
"button[aria-label^='Photo of'] img",
"button[aria-label^='写真'] img",
"button[jsaction*='image'] img",
"button[jsaction*='photo'] img",
"div.ZKCDEc [data-photo-index='0'] img",
"[data-photo-index='0'] img",
"[data-photo-index] img",
], document)
])
|| firstBackgroundImageUrl([
"button[jsaction*='image']",
"button[jsaction*='photo']",
"div.RZ66Rb button[jsaction*='heroHeaderImage']",
"button[jsaction*='heroHeaderImage']",
"div.ZKCDEc [data-photo-index='0']",
"[data-photo-index='0']",
"[data-photo-index]",
"[aria-label*='Photo']",
"[aria-label*='photo']",
"[aria-label*='写真']",
"[aria-label*='画像']",
], document);
]);
const photoUrl = mainPhotoUrl
|| firstAttr(["meta[property='og:image']", "meta[itemprop='image']"], "content", document);

Expand Down Expand Up @@ -1033,7 +1125,7 @@
"button[data-item-id^='phone:']",
]),
plus_code: itemValue("oloc"),
description: firstText([".WeS02d", ".PYvSYb"]),
description: descriptionValue(),
review_topics: collectReviewTopics(),
admission_prices: collectLeafPrices(sectionRootByHeading([
"Admission",
Expand Down Expand Up @@ -1324,24 +1416,33 @@
}
return {category: null, address: null};
};
const findDescriptionLine = (lines, excludedValues) => {
const cardDescription = (article, excludedValues) => {
const excluded = new Set(excludedValues.map(cleanLine).filter(Boolean));
return lines.find((line) => {
const text = cleanLine(line);
const rows = Array.from(article.querySelectorAll("div.W4Efsd"));
for (const row of rows) {
const text = cleanLine(row.innerText || row.textContent || "");
if (!text || excluded.has(text)) {
return false;
continue;
}
if (text.includes("·") || parseCardRating(text) || parseCardReviewCount(text)) {
return false;
continue;
}
if (
row.querySelector(
".AJB7ye, .UsdlK, [role='img'][aria-label*='star' i], a[href^='tel:']",
)
) {
continue;
}
if (/^(open|closed|temporarily closed|website|directions|saved in)\b/i.test(text)) {
return false;
continue;
}
if (/^[+()\d\s.-]{7,}$/.test(text)) {
return false;
continue;
}
return text.length >= 12;
}) || null;
return text;
}
Comment on lines +1419 to +1444
return null;
Comment on lines +1430 to +1445
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid returning UI-action tokens from cardDescription before real description rows.

If the first matched row is "Share"/"Save"/"Call", this function returns it and the Python cleaner drops it later, so later valid description rows are never considered.

Suggested minimal fix
-      if (/^(open|closed|temporarily closed|website|directions|saved in)\b/i.test(text)) {
+      if (
+        /^(open|closed|temporarily closed|website|directions|saved in|call|save|saved|share|nearby|send to phone|sign in)\b/i.test(text)
+      ) {
         continue;
       }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/gmaps_scraper/place_scraper.py` around lines 1428 - 1443, The
cardDescription function currently returns the first matching row text, which
can be UI-action tokens like "Share"/"Save"/"Call" and prevent real description
rows from being considered; update the loop in cardDescription to treat short UI
action tokens as non-descriptive by continuing instead of returning: add a check
against common UI verbs/labels (e.g., /^((share|save|call|website|directions|get
directions|saved in|view menu|write a review)\b)/i or a small-word-length
heuristic) using the existing text variable and rows iteration, and only return
text when it does not match those UI-action patterns and looks like a real
description.

};
const safeDecodeURIComponent = (value) => {
try {
Expand Down Expand Up @@ -1373,10 +1474,11 @@
review_count: reviewCount,
category: categoryAddress.category,
address: categoryAddress.address,
search_result_description: findDescriptionLine(
lines,
search_result_description: cardDescription(
article,
[name, categoryAddress.category, categoryAddress.address],
),
panel_text: lines.join("\n"),
body_text: lines.join("\n"),
};
};
Expand Down Expand Up @@ -2321,6 +2423,7 @@ def _search_result_snapshot(candidate: str | Mapping[str, object]) -> dict[str,
"category",
"address",
"search_result_description",
"panel_text",
"body_text",
):
value = candidate.get(key)
Expand Down Expand Up @@ -2390,9 +2493,8 @@ def _build_place_details(
snapshot: Mapping[str, object],
) -> PlaceDetails:
panel_lines = _body_lines(snapshot.get("panel_text"))
body_lines = _body_lines(snapshot.get("body_text"))
search_lines = panel_lines or body_lines
combined_lines = _dedupe_lines([*panel_lines, *body_lines])
search_lines = panel_lines
combined_lines = _dedupe_lines(panel_lines)
Comment on lines 2495 to +2497
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve search-card panel lines for fallback extraction

When a /maps/search result is selected but opening the place page fails, _search_result_snapshot() still copies the card text only as body_text, while this change makes _build_place_details() ignore body_text and build all line-based fallbacks from panel_text only. In that fallback path the selected card's lines are therefore invisible to _extract_status_from_lines, _extract_phone_from_lines, _extract_plus_code_from_lines, etc., even though the search-result JS now emits panel_text; copy panel_text through or include the search-card body_text in this narrow path.

Useful? React with 👍 / 👎.

Comment on lines +2496 to +2497
category = _clean_category_text(snapshot.get("category")) or _extract_category_from_lines(
search_lines
)
Expand Down Expand Up @@ -2484,7 +2586,7 @@ def _build_place_details(
plus_code=_clean_plus_code_text(snapshot.get("plus_code"))
or _extract_plus_code_from_lines(combined_lines),
address_parts=_extract_address_parts(snapshot.get("address_parts")),
description=_extract_description(snapshot, combined_lines),
description=_extract_description(snapshot),
search_result_description=_clean_description_text(
snapshot.get("search_result_description")
),
Expand Down Expand Up @@ -3624,18 +3726,8 @@ def _extract_plus_code_from_lines(lines: list[str]) -> str | None:
return None


def _extract_description(snapshot: Mapping[str, object], lines: list[str]) -> str | None:
direct = _clean_description_text(snapshot.get("description"))
if direct is not None:
return direct
for index, line in enumerate(lines):
if line.startswith("Seasonal ") or line.startswith("Modern setting "):
return line
if line == "Share" and index + 1 < len(lines):
candidate = _clean_description_text(lines[index + 1])
if candidate is not None and candidate.lower() not in _DESCRIPTION_STOP_MARKERS:
return candidate
return None
def _extract_description(snapshot: Mapping[str, object]) -> str | None:
return _clean_description_text(snapshot.get("description"))


def _clean_description_text(value: object) -> str | None:
Expand All @@ -3652,7 +3744,11 @@ def _clean_description_text(value: object) -> str | None:
return None
if _looks_like_status_text(normalized):
return None
if _looks_like_search_results_label(normalized) or _looks_like_ui_action_label(normalized):
if (
_looks_like_search_results_label(normalized)
or _looks_like_ui_action_label(normalized)
or _looks_like_ui_action_cluster(normalized)
):
return None
if (
_looks_like_description_review_prose(normalized)
Expand All @@ -3674,6 +3770,12 @@ def _clean_description_text(value: object) -> str | None:
return normalized


def _looks_like_ui_action_cluster(value: str) -> bool:
text = re.sub(r"[\ue000-\uf8ff]", " ", value)
text = re.sub(r"\s+", " ", text).strip(" .")
return _UI_ACTION_CLUSTER_RE.fullmatch(text) is not None


def _strip_description_service_options(value: str) -> str | None:
segments = [_clean_description_segment(part) for part in re.split(r"[·•⋅]+", value)]
cleaned_segments = [segment for segment in segments if segment]
Expand Down Expand Up @@ -3796,6 +3898,10 @@ def _normalize_photo_url(value: object) -> str | None:
"googleusercontent.com" in host or host.endswith("ggpht.com")
) and path.startswith(("/a-", "/a/")):
return None
if re.fullmatch(r"lh[0-9]+\.(?:googleusercontent\.com|ggpht\.com)", host) is None:
return None
if re.search(r"(?:=|-)w[0-9]+-h[0-9]+(?:-|$)", normalized) is None:
return None
return normalized


Expand Down Expand Up @@ -4241,7 +4347,10 @@ def _looks_like_search_results_label(value: str) -> bool:
normalized = _clean_text(value)
if normalized is None:
return False
return normalized.casefold() in _SEARCH_RESULTS_LABELS
normalized_lookup = normalized.casefold()
return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith(
"sponsored "
)
Comment on lines +4350 to +4353
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Treat exact "Sponsored" as a search-result label too.

Line 4348 only catches values that start with "sponsored " (with a trailing space). A plain "Sponsored" label can still pass name/category cleaning.

Suggested fix
-    return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith(
-        "sponsored "
-    )
+    return (
+        normalized_lookup in _SEARCH_RESULTS_LABELS
+        or normalized_lookup == "sponsored"
+        or normalized_lookup.startswith("sponsored ")
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
normalized_lookup = normalized.casefold()
return normalized_lookup in _SEARCH_RESULTS_LABELS or normalized_lookup.startswith(
"sponsored "
)
normalized_lookup = normalized.casefold()
return (
normalized_lookup in _SEARCH_RESULTS_LABELS
or normalized_lookup == "sponsored"
or normalized_lookup.startswith("sponsored ")
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/gmaps_scraper/place_scraper.py` around lines 4347 - 4350, The check
currently only matches labels that start with "sponsored " and misses the exact
"sponsored" label; update the return condition in the block that computes
normalized_lookup (after normalized.casefold()) to also treat an exact match by
adding a check like normalized_lookup == "sponsored" (in addition to the
existing startswith check) when testing against _SEARCH_RESULTS_LABELS so both
"sponsored" and values starting with "sponsored " are considered search-result
labels.



def _looks_like_ui_action_label(value: str) -> bool:
Expand Down
20 changes: 20 additions & 0 deletions tests/test_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -350,6 +350,26 @@ def test_does_not_use_owner_profile_id_as_place_cid(self) -> None:
"https://www.google.com/maps/search/?api=1&query=Northwind+Cafe%2C+Example+District",
)

def test_does_not_use_owner_payload_inside_metadata_as_place_cid(self) -> None:
runtime_state = copy.deepcopy(["noise", _LIST_NODE])
first_place = runtime_state[1][8][0]
assert isinstance(first_place, list)
first_metadata = first_place[1]
assert isinstance(first_metadata, list)

first_metadata[6] = [
"Fixture Owner",
"https://lh3.googleusercontent.com/a-/fixture-owner",
"104356373423434804635",
]

parsed = parse_saved_list_artifacts(_LIST_URL, runtime_state=runtime_state)

self.assertEqual(len(parsed.places), 2)
self.assertEqual(parsed.places[0].cid, None)
self.assertEqual(parsed.places[0].google_id, "/g/11northwind")
self.assertNotEqual(parsed.places[0].cid, "104356373423434804635")

def test_extracts_favorite_and_note_from_user_payload_shape(self) -> None:
runtime_state = [
"noise",
Expand Down
Loading