Skip to content

Auto-translated map_item updates from 4CAT (bootstrap, 13 datasources)#90

Draft
4cat-to-zeeschuimer-automation-pr[bot] wants to merge 44 commits into
masterfrom
auto/4cat-map-item-sync-bootstrap
Draft

Auto-translated map_item updates from 4CAT (bootstrap, 13 datasources)#90
4cat-to-zeeschuimer-automation-pr[bot] wants to merge 44 commits into
masterfrom
auto/4cat-map-item-sync-bootstrap

Conversation

@4cat-to-zeeschuimer-automation-pr

Copy link
Copy Markdown

🤖 This PR was auto-generated by the 4CAT map_item sync workflow. The JavaScript was produced by an LLM and requires human review before merging — including manual fixes for any lint warnings flagged below.

Generation parameters

  • Model: gpt-oss-120b (provider: litellm)
  • Total LLM time: 381.15s
  • Trigger: manual workflow_dispatch with bootstrap=true (initial sync of all Zeeschuimer datasources).

Summary

  • ✅ 13 translated
  • ⚠️ 6 translated with lint warnings (require manual fix)
  • ❌ 1 failed
  • ❔ 0 skipped
Datasource Module Time Warnings
datasources/douyin/search_douyin.py modules/douyin.js 42.77s ⚠️ 1
datasources/gab/search_gab.py modules/gab.js 27.53s
datasources/imgur/search_imgur.py modules/imgur.js 12.93s
datasources/instagram/search_instagram.py modules/instagram.js 52.96s ⚠️ 1
datasources/linkedin/search_linkedin.py modules/linkedin.js 52.03s ⚠️ 1
datasources/ninegag/search_9gag.py modules/9gag.js 14.75s
datasources/pinterest/search_pinterest.py modules/pinterest.js 25.6s
datasources/threads/search_threads.py modules/threads.js 25.11s ⚠️ 2
datasources/tiktok/search_tiktok.py modules/tiktok.js 18.64s ⚠️ 2
datasources/tiktok_comments/search_tiktok_comments.py modules/tiktok-comments.js 13.58s
datasources/truth/search_truth.py modules/truth.js 19.59s
datasources/xiaohongshu/search_rednote.py modules/rednote.js 24.17s ⚠️ 4
datasources/xiaohongshu_comments/search_rednote_comments.py modules/rednote-comments.js 13.12s

⚠️ Lint warnings — fix before merging

The following datasources translated successfully but the static lint flagged issues that need human fixes. The auto-generated code was spliced into the JS module as-is; please patch the file directly in this PR.

datasources/douyin/search_douyin.py -> modules/douyin.js

  • [helpers_to_add[0]] Regex detected. The current LLM translates regex unreliably (escapes, character classes, flags) — please verify the regex behavior against the Python original by hand.

datasources/instagram/search_instagram.py -> modules/instagram.js

  • [helpers_to_add[0]] Literal newline inside a string literal — JS strings can't span lines without escape ("\n") or template literals (`\n`).

datasources/linkedin/search_linkedin.py -> modules/linkedin.js

  • [helpers_to_add[1]] Regex detected. The current LLM translates regex unreliably (escapes, character classes, flags) — please verify the regex behavior against the Python original by hand.

datasources/threads/search_threads.py -> modules/threads.js

  • [map_item_function] Literal newline inside a string literal — JS strings can't span lines without escape ("\n") or template literals (`\n`).
  • [map_item_function] Regex detected. The current LLM translates regex unreliably (escapes, character classes, flags) — please verify the regex behavior against the Python original by hand.

datasources/tiktok/search_tiktok.py -> modules/tiktok.js

  • [map_item_function] Even without an f prefix, "text {var}" / 'text {var}' are literal text in JavaScript — no interpolation happens. Whenever the original Python used an f-string, the JS must use a template literal (backticks).
  • [map_item_function] .get( call found. Python dict.get(k[, default]) does not exist in JavaScript — use [k] / [k] ?? default. NOTE: this check is a plain substring match, so it also flags legitimate JS .get() on Map, URLSearchParams, Headers, etc. — ignore the warning if the receiver is one of those.

datasources/xiaohongshu/search_rednote.py -> modules/rednote.js

  • [helpers_to_add[0]] Literal newline inside a string literal — JS strings can't span lines without escape ("\n") or template literals (`\n`).
  • [helpers_to_add[0]] Regex detected. The current LLM translates regex unreliably (escapes, character classes, flags) — please verify the regex behavior against the Python original by hand.
  • [helpers_to_add[1]] Literal newline inside a string literal — JS strings can't span lines without escape ("\n") or template literals (`\n`).
  • [helpers_to_add[1]] Regex detected. The current LLM translates regex unreliably (escapes, character classes, flags) — please verify the regex behavior against the Python original by hand.

datasources/douyin/search_douyin.py -> modules/douyin.js (42.77s) ⚠️

LLM commentary:

The JS version mirrors the Python logic closely. Helper getChineseNumber implements the same conversion of Chinese number strings (e.g. "1.2万" → 12000). All .get calls are replaced with optional chaining or explicit fallback (??). MissingMappedField objects are instantiated where the Python code used a default. The timestamp is formatted with the global formatUtcTimestamp helper, which expects a Unix‑seconds value. Keys that differ between embedded and non‑embedded payloads are set in the appropriate branch. The function assumes the global helpers (MappedItem, MissingMappedField, normalize_url_encoding, formatUtcTimestamp) are already loaded by Zeeschuimer. No additional imports are required.

datasources/gab/search_gab.py -> modules/gab.js (27.53s)

LLM commentary:

The JS version mirrors the Python map_item logic. All dict.get calls are replaced with optional chaining and nullish coalescing. Timestamp handling uses the global formatUtcTimestamp helper (expects seconds). Media handling reproduces the unknown-data collection and filters out missing URLs. Group, author, and card objects may be null; fields fall back to null when the container is absent, matching Python's conditional expressions. MissingMappedField is instantiated when the import metadata lacks a timestamp. No additional imports or helper functions were needed beyond the globals provided by Zeeschuimer.

datasources/imgur/search_imgur.py -> modules/imgur.js (12.93s)

LLM commentary:

The JS version mirrors the Python map_item exactly. Timestamp parsing uses Date.parse on the ISO string and then formatUtcTimestamp (global helper) to get the same "YYYY-MM-DD HH:MM:SS" format. Optional chaining (?.) guards against missing nested objects (e.g., cover). Boolean fields are converted to "yes"/"no" strings. All field names match the Python output. No extra imports or helper functions are required because Zeeschuimer provides MappedItem, normalize_url_encoding, formatUtcTimestamp, and MapItemException globally.

datasources/instagram/search_instagram.py -> modules/instagram.js (52.96s) ⚠️

LLM commentary:

Implemented map_item as a thin dispatcher that mirrors the Python logic. Added three parser helpers (Polaris, Graph, Itemlist) plus a small extractHashtags helper and the required constants. All Python constructs have been converted to JavaScript equivalents:

  • Used optional chaining and nullish coalescing instead of dict.get.
  • Replaced Python regex findall with matchAll to capture hashtag groups.
  • Used the global formatUtcTimestamp helper for timestamp formatting.
  • Wrapped all MissingMappedField, MappedItem and MapItemException calls with new.
  • Preserved field names exactly as in the Python version.
  • Throwing MapItemException for ads and for unexpected format changes matches the original behaviour.
    No external imports are needed because Zeeschuimer provides the required globals. The code assumes those globals exist at runtime.

datasources/linkedin/search_linkedin.py -> modules/linkedin.js (52.03s) ⚠️

LLM commentary:

Implemented the Python map_item logic in JavaScript, preserving field names exactly. Added two helper functions (getAuthor and parseTimeAgo) because they are used internally. Used global helpers MappedItem, normalize_url_encoding, and formatUtcTimestamp. Replaced Python's datetime formatting with formatUtcTimestamp, and used regex test for is_promoted logic. All optional property accesses use optional chaining and nullish coalescing to avoid TypeErrors. Image and video thumbnail extraction follows the same priority rules as the Python version. No external imports are required; helpers are added to the helpers_to_add array.

datasources/ninegag/search_9gag.py -> modules/9gag.js (14.75s)

LLM commentary:

The JS version mirrors the Python map_item logic. Timestamps are handled with the global formatUtcTimestamp helper, which expects a Unix‑seconds value (post.creationTs). Image and video selection replicate the Python sorting by pixel count, and video URL priority follows the same order (av1, h265, vp9, vp8). Anonymous posts are given the placeholder creator object. All optional fields use optional chaining and nullish coalescing to avoid undefined values. No extra imports or helper functions are required because Zeeschuimer already provides MappedItem, normalize_url_encoding, and formatUtcTimestamp as globals.

datasources/pinterest/search_pinterest.py -> modules/pinterest.js (25.6s)

LLM commentary:

Implemented map_item with two internal helpers mirroring the Python logic. Timestamp parsing uses Date and the global formatUtcTimestamp helper; on failure MissingMappedField objects are returned to match Python's MissingMappedField(''). All field names are preserved exactly. Optional chaining and nullish coalescing replace Python dict.get calls. Image URL extraction follows the same precedence as the Python version. No external imports are required because Zeeschuimer provides MappedItem, MissingMappedField, normalize_url_encoding, and formatUtcTimestamp globally.

datasources/threads/search_threads.py -> modules/threads.js (25.11s) ⚠️

LLM commentary:

Implemented the Python map_item logic in JavaScript. Used formatUtcTimestamp (global) for the timestamp string, URL and URLSearchParams to extract the optional 'u' query parameter, and matchAll with a captured‑group regex to build the hashtags list. Optional chaining and nullish coalescing guard against missing fields. Empty containers are joined with commas to match the Python ','.join behaviour. All fields from the Python MappedItem are reproduced with identical names and comparable values. No extra imports are required because all helpers are globals.

datasources/tiktok/search_tiktok.py -> modules/tiktok.js (18.64s) ⚠️

LLM commentary:

Implemented the Python map_item logic in JavaScript, preserving field names exactly. Used global helpers MappedItem, normalize_url_encoding, and formatUtcTimestamp. Handled thumbnail selection by extracting the last shareCover entry (if an array) and filtering by the x-expires query parameter. All optional fields default to empty strings or appropriate fallbacks to match Python's .get(..., "") behavior. No additional imports or helper functions were needed.

datasources/tiktok_comments/search_tiktok_comments.py -> modules/tiktok-comments.js (13.58s)

LLM commentary:

The JS version mirrors the Python map_item logic. Timestamp formatting uses the global formatUtcTimestamp helper to match Python's '%Y-%m-%d %H:%M:%S'. Nested fields are accessed safely with optional chaining and nullish coalescing, providing null defaults where Python would have used .get(..., default). Boolean checks are converted to !!value and ternary expressions to produce "yes"/"no" strings. No additional imports or helper functions are required because all needed utilities (MappedItem, normalize_url_encoding, formatUtcTimestamp) are globals provided by Zeeschuimer.

datasources/truth/search_truth.py -> modules/truth.js (19.59s)

LLM commentary:

The JS version mirrors the Python logic exactly: media handling, group fallback, thread ID resolution (including nested replies), and collection of mentions/hashtags. Dates are parsed with the native Date constructor and formatted manually to match Python's "%Y-%m-%d %H:%M:%S" output. All optional fields use null or default values so the resulting JSON retains keys (undefined would be dropped). Errors are accumulated and passed as the second argument to MappedItem, matching the Python MappedItem(message=...). No external imports are required because Zeeschuimer provides MappedItem and normalize_url_encoding globally.

datasources/xiaohongshu/search_rednote.py -> modules/rednote.js (24.17s) ⚠️

LLM commentary:

Implemented the three mapping branches (API explore, embedded JSON, HTML) as separate helper functions. All Python constructs have been converted to JavaScript equivalents:

  • f-strings become template literals.
  • dict.get() replaced with optional chaining and nullish coalescing.
  • Regex findall with capture groups uses matchAll and extracts group 1.
  • MissingMappedField instances are created with new MissingMappedField("") to mirror the Python placeholder.
  • Timestamps (ms) are converted to seconds before calling the global formatUtcTimestamp helper.
  • Image handling mirrors the original logic, returning a MissingMappedField when no images are found.
  • The tile‑stub rejection mirrors the Python condition and throws a MapItemException.
    No external imports are required; all helpers (MappedItem, MissingMappedField, normalize_url_encoding, formatUtcTimestamp, MapItemException) are globals provided by Zeeschuimer. The helper functions are supplied in helpers_to_add so they are available when map_item is executed.

datasources/xiaohongshu_comments/search_rednote_comments.py -> modules/rednote-comments.js (13.12s)

LLM commentary:

The JS version mirrors the Python map_item exactly. Timestamp conversion uses the global formatUtcTimestamp helper, which expects a Unix timestamp in seconds, matching the Python datetime.fromtimestamp(ms/1000). The collected_from_url field uses optional chaining and nullish coalescing to safely access the nested __import_meta object. ip_location follows the Python logic: if the value is truthy it is kept, otherwise a new MissingMappedField('') is inserted. All other fields are direct property accesses, with safe fallbacks for optional fields (content, user_info). No additional imports or helper functions are required because all needed utilities are provided globally by Zeeschuimer.

Failures

  • datasources/twitter-import/search_twitter.py (after 38.29s): could not parse a JSON object from the model reply

@dale-wahl

Copy link
Copy Markdown
Member

Worked through my tests. And we're clean for this:

Test Suites: 1 passed, 1 total
Tests:       1439 passed, 1439 total
Snapshots:   0 total
Time:        79.915 s
Ran all test suites.

=== map_item compare summary ===
  ✓ PASS  pinterest        5daeba72a2dfbb5ed8c855f824a61570  — 110/110 items match
  ✓ PASS  instagram        945bc3cd29726d676419339e3c8feeb9  — 124/124 items match
  ✓ PASS  9gag             92cb4f4865cd259ab868846b6e007000  — 48/48 items match
  ✓ PASS  douyin           9e355eb06a266576aebcd6ef3ab8f1c3  — 70/70 items match
  ✓ PASS  gab              05f95dfd6f827a9025a3d1f0828feb09  — 43/43 items match
  ✓ PASS  imgur            64d532cdae21835ee2313bc2bb9a060f  — 238/238 items match
  ✓ PASS  threads          474112f148078b2a5d350ab3d80e93ce  — 54/54 items match
  ✓ PASS  rednote          8e5f759a145555b8134f70afaea57109  — 171/171 items match
  ✓ PASS  tiktok-comments  f4644cb7ad521b0fed483a3172460a92  — 141/141 items match
  ✓ PASS  tiktok           c4ce8cafddc41555e9aed2901e183e53  — 120/120 items match
  ✓ PASS  truth            474f560c67d914cfc316a948a6994cb9  — 137/137 items match
  ✓ PASS  rednote          a74cc6406430422e866bce7443224a19  — 171/171 items match
12 datasource(s): 12 passed, 0 failed, 0 skipped

Looks like I am missing some datasources (and randomly have two rednotes).
I cannot get rednote-comments and linkedin is dead (right?). And Twitter/X failed to translate (so will try that again in a separate PR.

@dale-wahl

Copy link
Copy Markdown
Member

@stijn-uva these datasets were all pretty recent ones I collected to have something to test against and not necessarily comprehensive. Also, I was not super sure how to test all of this together so I ended up merging the map_item_testing_actual_tests branch into this one (and then made updates to the tests themselves here). Thus it is messier than I would have liked. But you should be able to test is out and download mapped CSVs from Zeeschuimer!

Perhaps we should delete the linkedin and rednote-comment mappers until we can test them properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant