From a908b730fd997aabfc4d1b52317a3a861462e5cb Mon Sep 17 00:00:00 2001 From: Marcel Rebro Date: Wed, 6 May 2026 19:00:34 +0200 Subject: [PATCH 1/8] docs(cheerio-scraper): rewrite README per PMM brief MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implements items #2, #3, #4, #5, #6, #7 from Fabian's PMM brief: - Adds inline non-technical redirect to AI Web Scraper / Academy tutorial. - Rewrites "Cost of usage" with two sample test runs against docs.apify.com (light vs heavier page function) and a clear caveat that exact cost depends on site complexity, page function, link graph, proxy, and memory. - Moves "Content types" under "Input configuration" (right before the page function section). - Adds AI Web Scraper mention to "Limitations". - Adds an "Integrations" section (Zapier, Make, Apify API). - Adds an "FAQ" section (page function, Puppeteer vs Cheerio, Playwright vs Cheerio, build your own with Crawlee). Items #1 (new "What is Cheerio Scraper?" H2 framing) and #8 (janitorial — Node.js version staleness, Web Scraper "Puppeteer-only" description) are still pending SME input and intentionally not in this draft. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../actor-scraper/cheerio-scraper/README.md | 103 +++++++++++++----- 1 file changed, 74 insertions(+), 29 deletions(-) diff --git a/packages/actor-scraper/cheerio-scraper/README.md b/packages/actor-scraper/cheerio-scraper/README.md index 8370736d..137d3f18 100644 --- a/packages/actor-scraper/cheerio-scraper/README.md +++ b/packages/actor-scraper/cheerio-scraper/README.md @@ -5,12 +5,24 @@ browser but instead constructs a DOM from an HTML string. It then provides the u Cheerio Scraper is ideal for scraping web pages that do not rely on client-side JavaScript to serve their content and can be up to 20 times faster than using a full-browser solution such as Puppeteer. -If you're unfamiliar with web scraping or web development in general, -you might prefer to start with [**Scraping with Web Scraper**](https://docs.apify.com/tutorials/apify-scrapers/web-scraper) tutorial from the Apify documentation and then continue with [**Scraping with Cheerio Scraper**](https://docs.apify.com/tutorials/apify-scrapers/cheerio-scraper), a tutorial which will walk you through all the steps and provide a number of examples. +Cheerio Scraper is built for technical users comfortable with [jQuery](https://jquery.com) and [Cheerio](https://cheerio.js.org). If you're not a developer, you'll likely have a better experience with [**AI Web Scraper**](https://apify.com/apify/ai-web-scraper) — describe what you want to extract in plain English, no page function required. If you'd like to learn how Cheerio Scraper works step by step, follow the [**Scraping with Cheerio Scraper**](https://docs.apify.com/academy/apify-scrapers/cheerio-scraper) tutorial in the Apify Academy. ## Cost of usage -You can find the average usage cost for this Actor on the [pricing page](https://apify.com/pricing) under the `Which plan do I need?` section. Cheerio Scraper is equivalent to `Simple HTML pages` while Web Scraper, Puppeteer Scraper and Playwright Scraper are equivalent to `Full web pages`. These cost estimates are based on averages and might be lower or higher depending on how heavy the pages you scrape are. +Cheerio Scraper is billed by [platform usage](https://apify.com/pricing) (compute units, storage operations, data transfer) rather than a flat per-result fee, so the exact cost of a run is hard to predict. It depends on **how many pages you crawl**, **how rich your page function is**, **how many links each page produces**, **page size**, **proxy choice**, and **memory allocation**. Treat the numbers below as illustrative samples, not a guaranteed price — for your own use case, run a small test first and extrapolate. + +For a quick orientation, the [pricing page](https://apify.com/pricing) lists average estimates under `Which plan do I need?`. Cheerio Scraper is equivalent to `Simple HTML pages`; Web Scraper, Puppeteer Scraper and Playwright Scraper are equivalent to `Full web pages`. + +### Sample runs + +Both samples below crawled the same site ([`docs.apify.com`](https://docs.apify.com)) on default settings (1024 MB memory, Apify Proxy, concurrency 50). They differ in page-function complexity and crawl size. + +| Sample | Pages | Page function | Runtime | Compute units | Total cost | +|-----------------------------------------|------:|------------------------------------------------------------------------------------------|---------:|--------------:|-----------:| +| Lightweight (title, h1, meta description) | 237 | 4 selectors | 3 min 15 s | 0.054 CU | **$0.024** | +| Heavier (all h2/h3, internal link list, code-block count, word count) | 485 | 8+ selectors plus body word count | 6 min 38 s | 0.111 CU | **$0.048** | + +Both samples worked out to roughly **$0.0001 per result** (~$0.10 per 1,000 results) on this site. Cost is dominated by compute units (~46%) and request-queue writes (~40%). On heavier sites — large pages, long link graphs, residential proxy, slower responses — the per-result figure can be several times higher, so use these numbers as a starting point only. ## Usage @@ -36,32 +48,6 @@ Cheerio Scraper has a number of advanced configuration settings to improve perfo Under the hood, Cheerio Scraper is built using the [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) class from Crawlee. If you'd like to learn more about the inner workings of the scraper, see the respective documentation. -## Content types - -By default, Cheerio Scraper only processes web pages with the `text/html`, `application/json`, `application/xml`, `application/xhtml+xml` MIME content types (as reported by the `Content-Type` HTTP header), -and skips pages with other content types. -If you want the crawler to process other content types, -use the **Additional MIME types** (`additionalMimeTypes`) input option. - -Note that while the default `Accept` HTTP header will allow any content type to be received, -HTML and XML are preferred over JSON and other types. Thus, if you're allowing additional MIME -types, and you're still receiving invalid responses, be sure to override the `Accept` -HTTP header setting in the requests from the scraper, -either in [**Start URLs**](#start-urls), [**Pseudo URLs**](#pseudo-urls) or in the **Prepare request function**. - -The web pages with various content types are parsed differently and -thus the `context` parameter of the [**Page function**](#page-function) will have different values: - -| **Content types** | [`context.body`](#body-stringbuffer) | [`context.$`](#-function) | [`context.json`](#json-object) | -| ------------------------------------------------------- | ------------------------------------ | ------------------------- | ------------------------------ | -| `text/html`, `application/xhtml+xml`, `application/xml` | `String` | `Function` | `null` | -| `application/json` | `String` | `null` | `Object` | -| Other | `Buffer` | `null` | `null` | - -The `Content-Type` HTTP header of the web page is parsed using the -content-type NPM package -and the result is stored in the [`context.contentType`](#contenttype-object) object. - ## Limitations The Actor does not employ a full-featured web browser such as Chromium or Firefox, so it will not be sufficient for web pages that render their content dynamically using client-side JavaScript. To scrape such sites, you might prefer to use [**Web Scraper**](https://apify.com/apify/web-scraper) (`apify/web-scraper`), which loads pages in a full browser and renders dynamic content. @@ -75,6 +61,8 @@ If you require other modules for your scraping, you'll need to develop a complet You can use the [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) class from Crawlee to get most of the functionality of Cheerio Scraper out of the box. +Don't know how to code a page function? The [**AI Web Scraper**](https://apify.com/apify/ai-web-scraper) lets you describe what to extract in plain English instead — no JavaScript required. + ## Input configuration As input, Cheerio Scraper Actor accepts a number of configurations. These can be entered either manually in the user interface in [Apify Console](https://console.apify.com), or programmatically in a JSON object using the [Apify API](https://apify.com/docs/api/v2#/reference/actors/run-collection/run-actor). For a complete list of input fields and their types, please visit the [Input](https://apify.com/apify/cheerio-scraper/input-schema) tab. @@ -154,6 +142,33 @@ Note that you don't need to use the **Pseudo-URLs** setting at all, because you can completely control which pages the scraper will access by calling `await context.enqueueRequest()` from the **[Page function](#page-function)**. +### Content types + +By default, Cheerio Scraper only processes web pages with the `text/html`, `application/json`, `application/xml`, `application/xhtml+xml` MIME content types (as reported by the `Content-Type` HTTP header), +and skips pages with other content types. This is an edge-case setting — most users won't need to change it. The most common reason to do so is when paginating through endpoints that return non-default content types (for example, a JSON API that drives the listing pages). + +If you want the crawler to process other content types, +use the **Additional MIME types** (`additionalMimeTypes`) input option. + +Note that while the default `Accept` HTTP header will allow any content type to be received, +HTML and XML are preferred over JSON and other types. Thus, if you're allowing additional MIME +types, and you're still receiving invalid responses, be sure to override the `Accept` +HTTP header setting in the requests from the scraper, +either in [**Start URLs**](#start-urls), [**Pseudo URLs**](#pseudo-urls) or in the **Prepare request function**. + +The web pages with various content types are parsed differently and +thus the `context` parameter of the [**Page function**](#page-function) will have different values: + +| **Content types** | [`context.body`](#body-stringbuffer) | [`context.$`](#-function) | [`context.json`](#json-object) | +| ------------------------------------------------------- | ------------------------------------ | ------------------------- | ------------------------------ | +| `text/html`, `application/xhtml+xml`, `application/xml` | `String` | `Function` | `null` | +| `application/json` | `String` | `null` | `Object` | +| Other | `Buffer` | `null` | `null` | + +The `Content-Type` HTTP header of the web page is parsed using the +content-type NPM package +and the result is stored in the [`context.contentType`](#contenttype-object) object. + ### Page function The **Page function** (`pageFunction`) field contains a single JavaScript function that enables the user to extract data from the web page, access its DOM, add new URLs to the request queue, and otherwise control Cheerio Scraper's operation. @@ -566,6 +581,36 @@ For more information, see [Datasets](https://docs.apify.com/storage#dataset) in or the [Get dataset items](https://docs.apify.com/api/v2#/reference/datasets/item-collection) endpoint in Apify API reference. +## Integrations + +Cheerio Scraper plugs into the rest of your stack through Apify's integrations layer. The most common ways to wire it up: + +- **[Zapier](https://apify.com/integrations/zapier)** — trigger runs and route scraped data to thousands of Zapier-compatible apps (Google Sheets, Airtable, Slack, HubSpot, and more) without writing code. +- **[Make](https://apify.com/integrations/make)** — build no-code automations that start a Cheerio Scraper run, transform the dataset, and forward results to other services. +- **[Apify API](https://docs.apify.com/api/v2)** — call the Actor programmatically, pass input as JSON, and pull results from the dataset. Ideal for embedding scraping into your own backend. + +For the full list, see [Apify integrations](https://docs.apify.com/platform/integrations). + +## FAQ + +### How do I build a page function? + +The fastest way is the step-by-step [**Scraping with Cheerio Scraper**](https://docs.apify.com/academy/apify-scrapers/cheerio-scraper) tutorial in the Apify Academy. It walks you through selecting elements with Cheerio, returning data, and following links. + +If you'd rather skip the page function entirely, try the [**AI Web Scraper**](https://apify.com/apify/ai-web-scraper) — you describe what to extract in plain English and the Actor handles the rest. + +### When should I use Puppeteer Scraper instead of Cheerio Scraper? + +Use **Cheerio Scraper** for static HTML pages — it's faster and cheaper because it doesn't run a browser. Use [**Puppeteer Scraper**](https://apify.com/apify/puppeteer-scraper) (or the simpler [**Web Scraper**](https://apify.com/apify/web-scraper)) when the content you need is rendered by client-side JavaScript and isn't present in the raw HTML response. + +### When should I use Playwright Scraper instead of Cheerio Scraper? + +Same trade-off as above: if the page needs a real browser to render its content, reach for [**Playwright Scraper**](https://apify.com/apify/playwright-scraper). Playwright also has stronger support for Firefox and WebKit than Puppeteer if your target site behaves differently across browsers. + +### Can I build my own Actor with Cheerio? + +Yes. Cheerio Scraper is open source — see the [source on GitHub](https://github.com/apify/actor-scraper/tree/master/packages/actor-scraper/cheerio-scraper). To build a custom Actor with the same engine, use Crawlee's [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler) class directly — you get full control over the crawl while keeping Cheerio's parsing and Apify's platform features. + ## Additional resources Congratulations! You've learned how Cheerio Scraper works. From 36fb789729be9aa829bd5b95511fd3816631ef3f Mon Sep 17 00:00:00 2001 From: Marcel Rebro Date: Wed, 6 May 2026 19:07:23 +0200 Subject: [PATCH 2/8] docs(cheerio-scraper): align README with Apify-owned house pattern Applies precedents established in Marcel's prior Actor README updates (Facebook Groups/Reviews/Posts, Instagram Comment Scraper): - Rewrites "What is Cheerio Scraper?" as a single H2 with a short lead, emoji feature bullets, and a closing audience/use-case paragraph (replaces the previous multi-paragraph academic intro, while keeping the technical-audience framing and the inline AI Web Scraper redirect). - Moves the top-level "Integrations" H2 into the FAQ as a sub- question with the standard service list and webhooks line, matching the house pattern. - Adds the standard FAQ sub-questions used across all Apify-owned scraper READMEs: API access, MCP server, proxies, legality, and "not working?". Items #1 (overlap with existing "What is Cheerio Scraper?" framing) and #5 (trailing "What is Cheerio web scraper?" mention) are reconciled as a single section, per house pattern. Items #1 and #8 janitorial cleanups still pending SME input. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../actor-scraper/cheerio-scraper/README.md | 57 ++++++++++++++----- 1 file changed, 42 insertions(+), 15 deletions(-) diff --git a/packages/actor-scraper/cheerio-scraper/README.md b/packages/actor-scraper/cheerio-scraper/README.md index 137d3f18..197243b5 100644 --- a/packages/actor-scraper/cheerio-scraper/README.md +++ b/packages/actor-scraper/cheerio-scraper/README.md @@ -1,11 +1,22 @@ -Cheerio Scraper is a ready-made solution for crawling websites using plain HTTP requests. It retrieves the HTML pages, parses them using the [Cheerio](https://cheerio.js.org) Node.js library and lets you extract any data from them. Fast. +## What is Cheerio Scraper? -Cheerio is a server-side version of the popular [jQuery](https://jquery.com) library. It does not require a -browser but instead constructs a DOM from an HTML string. It then provides the user an API to work with that DOM. +It's a fast, server-side scraper that pulls plain HTML over HTTP and parses it with [Cheerio](https://cheerio.js.org) — the server-side equivalent of [jQuery](https://jquery.com). No browser, no client-side JavaScript: just the raw HTML response and a familiar selector API. With Cheerio Scraper, you can: -Cheerio Scraper is ideal for scraping web pages that do not rely on client-side JavaScript to serve their content and can be up to 20 times faster than using a full-browser solution such as Puppeteer. +⚡ Run **up to 20× faster** than full-browser scrapers — no Chrome to spin up -Cheerio Scraper is built for technical users comfortable with [jQuery](https://jquery.com) and [Cheerio](https://cheerio.js.org). If you're not a developer, you'll likely have a better experience with [**AI Web Scraper**](https://apify.com/apify/ai-web-scraper) — describe what you want to extract in plain English, no page function required. If you'd like to learn how Cheerio Scraper works step by step, follow the [**Scraping with Cheerio Scraper**](https://docs.apify.com/academy/apify-scrapers/cheerio-scraper) tutorial in the Apify Academy. +🧩 Use **jQuery-style selectors** via [Cheerio](https://cheerio.js.org) to extract any data from the parsed DOM + +🔗 **Crawl recursively** with Link selector, Glob Patterns, and Pseudo-URLs — pagination, sitemaps, full-site crawls + +🛠 Write a **custom page function** in JavaScript with full access to Cheerio, the request, the response, the dataset, and the request queue + +📦 Export results as **JSON, CSV, XML, Excel, or HTML**, or pull them via the [Apify API](https://docs.apify.com/api/v2) + +🔌 Plug into **Make, Zapier, webhooks, MCP servers**, and the rest of [Apify's integrations](https://apify.com/integrations) + +🪪 **Open source** — see the [source on GitHub](https://github.com/apify/actor-scraper/tree/master/packages/actor-scraper/cheerio-scraper), or build your own with Crawlee's [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler) + +Cheerio Scraper is built for technical users comfortable with [jQuery](https://jquery.com) and Cheerio. If you're not a developer, you'll likely have a better experience with [**AI Web Scraper**](https://apify.com/apify/ai-web-scraper) — describe what you want to extract in plain English, no page function required. To learn how Cheerio Scraper works step by step, follow the [**Scraping with Cheerio Scraper**](https://docs.apify.com/academy/apify-scrapers/cheerio-scraper) tutorial in the Apify Academy. ## Cost of usage @@ -581,16 +592,6 @@ For more information, see [Datasets](https://docs.apify.com/storage#dataset) in or the [Get dataset items](https://docs.apify.com/api/v2#/reference/datasets/item-collection) endpoint in Apify API reference. -## Integrations - -Cheerio Scraper plugs into the rest of your stack through Apify's integrations layer. The most common ways to wire it up: - -- **[Zapier](https://apify.com/integrations/zapier)** — trigger runs and route scraped data to thousands of Zapier-compatible apps (Google Sheets, Airtable, Slack, HubSpot, and more) without writing code. -- **[Make](https://apify.com/integrations/make)** — build no-code automations that start a Cheerio Scraper run, transform the dataset, and forward results to other services. -- **[Apify API](https://docs.apify.com/api/v2)** — call the Actor programmatically, pass input as JSON, and pull results from the dataset. Ideal for embedding scraping into your own backend. - -For the full list, see [Apify integrations](https://docs.apify.com/platform/integrations). - ## FAQ ### How do I build a page function? @@ -611,6 +612,32 @@ Same trade-off as above: if the page needs a real browser to render its content, Yes. Cheerio Scraper is open source — see the [source on GitHub](https://github.com/apify/actor-scraper/tree/master/packages/actor-scraper/cheerio-scraper). To build a custom Actor with the same engine, use Crawlee's [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler) class directly — you get full control over the crawl while keeping Cheerio's parsing and Apify's platform features. +### Can I export Cheerio Scraper data using the Apify API? + +Yes. The Apify API gives you programmatic access to your runs and datasets. To access the API using Node.js, use the `apify-client` [NPM package](https://apify.com/apify/cheerio-scraper/api/javascript). To access the API using Python, use the `apify-client` [PyPI package](https://apify.com/apify/cheerio-scraper/api/python). Check out the [Apify API reference](https://docs.apify.com/api/v2) docs or click on the [API tab](https://apify.com/apify/cheerio-scraper/api) for code examples. + +### Can I use Cheerio Scraper through an MCP server? + +Yes. With Apify's [MCP server](https://apify.com/apify/cheerio-scraper/api/mcp) you can run Cheerio Scraper inside AI agent workflows from clients like Claude Desktop and LibreChat, or build your own. See the [MCP tab](https://apify.com/apify/cheerio-scraper/api/mcp) for setup details. + +### Do I need proxies to use Cheerio Scraper? + +You usually do, especially for sites with anti-scraping protections. Cheerio Scraper integrates with [Apify Proxy](https://apify.com/proxy): datacenter proxies are included in the Free plan; residential proxies are available on paid plans. Configure them under [**Proxy configuration**](#proxy-configuration). + +### Can I integrate Cheerio Scraper with other apps? + +Yes. Cheerio Scraper can be connected with almost any cloud service or web app thanks to [integrations on the Apify platform](https://apify.com/integrations). You can integrate with Make, Zapier, ChatGPT, Slack, Airbyte, GitHub, Google Sheets, Asana, Google Drive, Keboola, MCP Servers, and more. + +You can also use [webhooks](https://docs.apify.com/integrations/webhooks) to carry out an action whenever an event occurs, e.g., get a notification whenever a Cheerio Scraper run successfully finishes. + +### Is it legal to scrape with Cheerio Scraper? + +Cheerio Scraper extracts whatever the target site serves over public HTTP — your responsibility is to scrape ethically and respect the site's terms of service, `robots.txt`, and applicable law. You should not scrape personal data unless you have a legitimate reason to do so. Read more on the [legality of web scraping](https://blog.apify.com/is-web-scraping-legal/) and [ethical scraping](https://blog.apify.com/what-is-ethical-web-scraping-and-how-do-you-do-it/). + +### Cheerio Scraper is not working? + +We're always working on improving the performance of our Actors. If you've got technical feedback or found a bug, please create an issue on the Actor's [Issues tab](https://apify.com/apify/cheerio-scraper/issues/open). + ## Additional resources Congratulations! You've learned how Cheerio Scraper works. From d2506992b219c4c0e6228faa447a62b9b8658513 Mon Sep 17 00:00:00 2001 From: Marcel Rebro Date: Wed, 6 May 2026 19:12:39 +0200 Subject: [PATCH 3/8] docs(cheerio-scraper): describe request queue and uniqueKey behavior MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replaces the long-standing "TODO: Describe how the queue works, unique key etc. plus link" placeholder with a short paragraph that covers what the request queue is, how uniqueKey deduplication works, the URL-fragment stripping default, and how to override uniqueKey from enqueueRequest() in the page function. The second pre-existing TODO (lines 425-426 about prepareRequestFunction) is intentionally left in place — it depends on whether the feature still exists in the current build, which is a question for the dev team. Co-Authored-By: Claude Opus 4.7 (1M context) --- packages/actor-scraper/cheerio-scraper/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/packages/actor-scraper/cheerio-scraper/README.md b/packages/actor-scraper/cheerio-scraper/README.md index 197243b5..1f74fcb5 100644 --- a/packages/actor-scraper/cheerio-scraper/README.md +++ b/packages/actor-scraper/cheerio-scraper/README.md @@ -90,7 +90,7 @@ Optionally, each URL can be associated with custom user data - a JSON object tha your JavaScript code in the [**Page function**](#page-function) under `context.request.userData`. This is useful for determining which start URL is currently loaded, in order to perform some page-specific actions. For example, when crawling an online store, you might want to perform different actions on a page listing the products vs. a product detail page. For details, see the [**Web scraping tutorial**](https://docs.apify.com/tutorials/apify-scrapers/getting-started#the-start-url) in the Apify documentation. - +Cheerio Scraper uses an Apify [request queue](https://docs.apify.com/platform/storage/request-queue) to track the URLs it has loaded and the URLs it still needs to load. Each request is identified by a `uniqueKey` — by default the request URL, with the URL fragment (`#...`) stripped unless the **Keep URL fragments** option is enabled. Requests whose `uniqueKey` has already been seen are skipped, so the same page isn't loaded twice. You can override `uniqueKey` per request when calling `context.enqueueRequest()` from the page function — useful when you need to scrape the same URL multiple times with different `userData`. ### Link selector From a8b8d6c841dbec05a7ba431adf2527dd72bd76cc Mon Sep 17 00:00:00 2001 From: Marcel Rebro Date: Wed, 6 May 2026 19:22:30 +0200 Subject: [PATCH 4/8] docs(cheerio-scraper): promote Integrations to top-level section Moves the integrations content out of the FAQ ("Can I integrate Cheerio Scraper with other apps?") and into a dedicated `## Integrations` H2 placed between `## Results` and `## FAQ`. Same copy, just promoted from a sub-question. Co-Authored-By: Claude Opus 4.7 (1M context) --- packages/actor-scraper/cheerio-scraper/README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/packages/actor-scraper/cheerio-scraper/README.md b/packages/actor-scraper/cheerio-scraper/README.md index 1f74fcb5..b1181ae8 100644 --- a/packages/actor-scraper/cheerio-scraper/README.md +++ b/packages/actor-scraper/cheerio-scraper/README.md @@ -592,6 +592,12 @@ For more information, see [Datasets](https://docs.apify.com/storage#dataset) in or the [Get dataset items](https://docs.apify.com/api/v2#/reference/datasets/item-collection) endpoint in Apify API reference. +## Integrations + +Cheerio Scraper can be connected with almost any cloud service or web app thanks to [integrations on the Apify platform](https://apify.com/integrations). You can integrate with Make, Zapier, ChatGPT, Slack, Airbyte, GitHub, Google Sheets, Asana, Google Drive, Keboola, MCP Servers, and more. + +You can also use [webhooks](https://docs.apify.com/integrations/webhooks) to carry out an action whenever an event occurs, e.g., get a notification whenever a Cheerio Scraper run successfully finishes. + ## FAQ ### How do I build a page function? @@ -624,12 +630,6 @@ Yes. With Apify's [MCP server](https://apify.com/apify/cheerio-scraper/api/mcp) You usually do, especially for sites with anti-scraping protections. Cheerio Scraper integrates with [Apify Proxy](https://apify.com/proxy): datacenter proxies are included in the Free plan; residential proxies are available on paid plans. Configure them under [**Proxy configuration**](#proxy-configuration). -### Can I integrate Cheerio Scraper with other apps? - -Yes. Cheerio Scraper can be connected with almost any cloud service or web app thanks to [integrations on the Apify platform](https://apify.com/integrations). You can integrate with Make, Zapier, ChatGPT, Slack, Airbyte, GitHub, Google Sheets, Asana, Google Drive, Keboola, MCP Servers, and more. - -You can also use [webhooks](https://docs.apify.com/integrations/webhooks) to carry out an action whenever an event occurs, e.g., get a notification whenever a Cheerio Scraper run successfully finishes. - ### Is it legal to scrape with Cheerio Scraper? Cheerio Scraper extracts whatever the target site serves over public HTTP — your responsibility is to scrape ethically and respect the site's terms of service, `robots.txt`, and applicable law. You should not scrape personal data unless you have a legitimate reason to do so. Read more on the [legality of web scraping](https://blog.apify.com/is-web-scraping-legal/) and [ethical scraping](https://blog.apify.com/what-is-ethical-web-scraping-and-how-do-you-do-it/). From 625cb5c636c068e2881826eb9426cf5ec0220b9a Mon Sep 17 00:00:00 2001 From: Marcel Rebro Date: Wed, 6 May 2026 19:39:09 +0200 Subject: [PATCH 5/8] docs(cheerio-scraper): apply brief's literal FAQ Q2/Q3 link targets MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Realigns FAQ Q2 ("Puppeteer vs Cheerio?") and Q3 ("Playwright vs Cheerio?") with the PMM brief's literal instructions, now that the Actor/library relationship is verified: - Web Scraper uses Puppeteer (confirmed via package.json + README internal references in repos/actor-scraper/packages/actor-scraper/ web-scraper/), so the brief's instruction to link Web Scraper for the "Puppeteer vs Cheerio?" question is internally consistent. - Q2 now links Web Scraper as the primary on-ramp (per brief), with Puppeteer Scraper mentioned as the lower-level option. - Headings switched to library-level phrasing ("Puppeteer instead of Cheerio") to match the brief's wording for both Q2 and Q3. - Q3 simplified — there's only one Playwright-based Actor. Co-Authored-By: Claude Opus 4.7 (1M context) --- packages/actor-scraper/cheerio-scraper/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/packages/actor-scraper/cheerio-scraper/README.md b/packages/actor-scraper/cheerio-scraper/README.md index b1181ae8..cc82965c 100644 --- a/packages/actor-scraper/cheerio-scraper/README.md +++ b/packages/actor-scraper/cheerio-scraper/README.md @@ -606,13 +606,13 @@ The fastest way is the step-by-step [**Scraping with Cheerio Scraper**](https:// If you'd rather skip the page function entirely, try the [**AI Web Scraper**](https://apify.com/apify/ai-web-scraper) — you describe what to extract in plain English and the Actor handles the rest. -### When should I use Puppeteer Scraper instead of Cheerio Scraper? +### When should I use Puppeteer instead of Cheerio? -Use **Cheerio Scraper** for static HTML pages — it's faster and cheaper because it doesn't run a browser. Use [**Puppeteer Scraper**](https://apify.com/apify/puppeteer-scraper) (or the simpler [**Web Scraper**](https://apify.com/apify/web-scraper)) when the content you need is rendered by client-side JavaScript and isn't present in the raw HTML response. +Use **Cheerio Scraper** for static HTML — it's faster and cheaper because no browser is involved. When the content needs client-side JavaScript to render, reach for a Puppeteer-based browser scraper: [**Web Scraper**](https://apify.com/apify/web-scraper) is the easy on-ramp (simpler API, runs in the browser context), while [**Puppeteer Scraper**](https://apify.com/apify/puppeteer-scraper) gives you lower-level control over the Puppeteer library directly. -### When should I use Playwright Scraper instead of Cheerio Scraper? +### When should I use Playwright instead of Cheerio? -Same trade-off as above: if the page needs a real browser to render its content, reach for [**Playwright Scraper**](https://apify.com/apify/playwright-scraper). Playwright also has stronger support for Firefox and WebKit than Puppeteer if your target site behaves differently across browsers. +If the page needs a real browser to render its content — and you want stronger support for Firefox and WebKit than Puppeteer offers — reach for [**Playwright Scraper**](https://apify.com/apify/playwright-scraper). ### Can I build my own Actor with Cheerio? From 149aa09936f0d94c545b22ff017380f60ad141f9 Mon Sep 17 00:00:00 2001 From: Marcel Rebro Date: Wed, 6 May 2026 19:43:28 +0200 Subject: [PATCH 6/8] docs(cheerio-scraper): link Cheerio (not Cheerio Scraper source) in FAQ Q4 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The FAQ question "Can I build my own Actor with Cheerio?" is asking about building a custom Actor on top of the Cheerio library — not about forking Cheerio Scraper. Swap the Cheerio Scraper source link for a link to cheerio.js.org so the answer points at what the user actually needs. Cheerio Scraper's open-source link is still referenced in the intro section's "Open source" bullet. Co-Authored-By: Claude Opus 4.7 (1M context) --- packages/actor-scraper/cheerio-scraper/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/packages/actor-scraper/cheerio-scraper/README.md b/packages/actor-scraper/cheerio-scraper/README.md index cc82965c..6124e5dc 100644 --- a/packages/actor-scraper/cheerio-scraper/README.md +++ b/packages/actor-scraper/cheerio-scraper/README.md @@ -616,7 +616,7 @@ If the page needs a real browser to render its content — and you want stronger ### Can I build my own Actor with Cheerio? -Yes. Cheerio Scraper is open source — see the [source on GitHub](https://github.com/apify/actor-scraper/tree/master/packages/actor-scraper/cheerio-scraper). To build a custom Actor with the same engine, use Crawlee's [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler) class directly — you get full control over the crawl while keeping Cheerio's parsing and Apify's platform features. +Yes. Build a custom Actor on top of [Cheerio](https://cheerio.js.org) using Crawlee's [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler) class — you get full control over the crawl while keeping Cheerio's parsing API and Apify's platform features. ### Can I export Cheerio Scraper data using the Apify API? From 53e90a3c4eb814ab2f3a0c6ab18e706006367d33 Mon Sep 17 00:00:00 2001 From: Marcel Rebro Date: Thu, 7 May 2026 13:06:49 +0200 Subject: [PATCH 7/8] docs(cheerio-scraper): mention forking the Actor in FAQ Q4 Fabian flagged that the open-source/fork path matters for users asking "Can I build my own Actor with Cheerio?". Restores the link to Cheerio Scraper's source as the primary "fork and adjust" path, keeping the build-from-scratch route (Crawlee + Cheerio) as the secondary option. Co-Authored-By: Claude Opus 4.7 (1M context) --- packages/actor-scraper/cheerio-scraper/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/packages/actor-scraper/cheerio-scraper/README.md b/packages/actor-scraper/cheerio-scraper/README.md index 6124e5dc..3da32644 100644 --- a/packages/actor-scraper/cheerio-scraper/README.md +++ b/packages/actor-scraper/cheerio-scraper/README.md @@ -616,7 +616,7 @@ If the page needs a real browser to render its content — and you want stronger ### Can I build my own Actor with Cheerio? -Yes. Build a custom Actor on top of [Cheerio](https://cheerio.js.org) using Crawlee's [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler) class — you get full control over the crawl while keeping Cheerio's parsing API and Apify's platform features. +Yes. The Cheerio Scraper Actor is open source — you can [fork it](https://github.com/apify/actor-scraper/tree/master/packages/actor-scraper/cheerio-scraper) and adjust it to your needs. Or build a custom Actor from scratch on top of [Cheerio](https://cheerio.js.org) using Crawlee's [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler) class — you get full control over the crawl with Cheerio's parsing API and Apify's platform features. ### Can I export Cheerio Scraper data using the Apify API? From e478220ea96835263e9e323ddd3cb9d731828603 Mon Sep 17 00:00:00 2001 From: Marcel Rebro Date: Thu, 7 May 2026 16:28:55 +0200 Subject: [PATCH 8/8] docs(cheerio-scraper): apply Fabian's review feedback MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Round of edits after PMM review session: - Intro emoji bullets: drop Apify-jargon list from the page-function bullet ("the request, the response, the dataset, the request queue") and shorten to "extract data and steer the crawl". Verb-led "Fork" bullet links the GitHub repo (where forking is a real action). - Intro closing paragraph: add JS-heavy redirect to Web/Puppeteer/ Playwright Scraper, in addition to the existing AI Web Scraper redirect for non-developers. - FAQ: combine "Puppeteer vs Cheerio" and "Playwright vs Cheerio" into a single entry. Lead with the advantage of browser-based scrapers over Cheerio (dynamic content, interactions, login flows), then quickly cover the Puppeteer vs Playwright difference (browser support). - FAQ "Can I build my own Actor with Cheerio?": split source-code links by intent — view on Apify, fork on GitHub. Replace generic templates link with the Cheerio-filtered query. - Additional resources: drop the Puppeteer Scraper and Playwright Scraper bullets — they're already covered in the combined FAQ entry and the intro closing paragraph. - Lowercase npm in 3 places per Apify docs convention. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../actor-scraper/cheerio-scraper/README.md | 29 +++++++++---------- 1 file changed, 13 insertions(+), 16 deletions(-) diff --git a/packages/actor-scraper/cheerio-scraper/README.md b/packages/actor-scraper/cheerio-scraper/README.md index 3da32644..7b332997 100644 --- a/packages/actor-scraper/cheerio-scraper/README.md +++ b/packages/actor-scraper/cheerio-scraper/README.md @@ -8,15 +8,15 @@ It's a fast, server-side scraper that pulls plain HTML over HTTP and parses it w 🔗 **Crawl recursively** with Link selector, Glob Patterns, and Pseudo-URLs — pagination, sitemaps, full-site crawls -🛠 Write a **custom page function** in JavaScript with full access to Cheerio, the request, the response, the dataset, and the request queue +🛠 Write a **custom page function** in JavaScript to extract data and steer the crawl 📦 Export results as **JSON, CSV, XML, Excel, or HTML**, or pull them via the [Apify API](https://docs.apify.com/api/v2) 🔌 Plug into **Make, Zapier, webhooks, MCP servers**, and the rest of [Apify's integrations](https://apify.com/integrations) -🪪 **Open source** — see the [source on GitHub](https://github.com/apify/actor-scraper/tree/master/packages/actor-scraper/cheerio-scraper), or build your own with Crawlee's [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler) +🪪 Fork the [**open-source Actor**](https://github.com/apify/actor-scraper/tree/master/packages/actor-scraper/cheerio-scraper) on GitHub, or build your own with Crawlee's [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler) -Cheerio Scraper is built for technical users comfortable with [jQuery](https://jquery.com) and Cheerio. If you're not a developer, you'll likely have a better experience with [**AI Web Scraper**](https://apify.com/apify/ai-web-scraper) — describe what you want to extract in plain English, no page function required. To learn how Cheerio Scraper works step by step, follow the [**Scraping with Cheerio Scraper**](https://docs.apify.com/academy/apify-scrapers/cheerio-scraper) tutorial in the Apify Academy. +Cheerio Scraper is built for technical users comfortable with [jQuery](https://jquery.com) and Cheerio, and works on **static HTML pages**. For pages that render content with client-side JavaScript, reach for a browser-based scraper instead — [**Web Scraper**](https://apify.com/apify/web-scraper), [**Puppeteer Scraper**](https://apify.com/apify/puppeteer-scraper), or [**Playwright Scraper**](https://apify.com/apify/playwright-scraper). If you're not a developer, [**AI Web Scraper**](https://apify.com/apify/ai-web-scraper) lets you describe what to extract in plain English — no page function required. To learn Cheerio Scraper step by step, follow the [**Scraping with Cheerio Scraper**](https://docs.apify.com/academy/apify-scrapers/cheerio-scraper) tutorial in the Apify Academy. ## Cost of usage @@ -67,7 +67,7 @@ Since Cheerio Scraper's **Page function** is executed in the context of the serv [**Puppeteer Scraper**](https://apify.com/apify/puppeteer-scraper) (`apify/puppeteer-scraper`). If you prefer Firefox and/or [Playwright](https://github.com/microsoft/playwright), check out [**Playwright Scraper**](https://apify.com/apify/playwright-scraper) (`apify/playwright-scraper`). For even more flexibility and control, you might develop a new Actor from scratch in Node.js using [Apify SDK](https://sdk.apify.com/) and [Crawlee](https://crawlee.dev). In the [**Page function**](#page-function) and **Prepare request function**, -you can only use NPM modules that are already installed in this Actor. +you can only use npm modules that are already installed in this Actor. If you require other modules for your scraping, you'll need to develop a completely new Actor. You can use the [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) class from Crawlee to get most of the functionality of Cheerio Scraper out of the box. @@ -177,7 +177,7 @@ thus the `context` parameter of the [**Page function**](#page-function) will hav | Other | `Buffer` | `null` | `null` | The `Content-Type` HTTP header of the web page is parsed using the -content-type NPM package +content-type npm package and the result is stored in the [`context.contentType`](#contenttype-object) object. ### Page function @@ -606,21 +606,23 @@ The fastest way is the step-by-step [**Scraping with Cheerio Scraper**](https:// If you'd rather skip the page function entirely, try the [**AI Web Scraper**](https://apify.com/apify/ai-web-scraper) — you describe what to extract in plain English and the Actor handles the rest. -### When should I use Puppeteer instead of Cheerio? +### When should I use Puppeteer or Playwright instead of Cheerio? -Use **Cheerio Scraper** for static HTML — it's faster and cheaper because no browser is involved. When the content needs client-side JavaScript to render, reach for a Puppeteer-based browser scraper: [**Web Scraper**](https://apify.com/apify/web-scraper) is the easy on-ramp (simpler API, runs in the browser context), while [**Puppeteer Scraper**](https://apify.com/apify/puppeteer-scraper) gives you lower-level control over the Puppeteer library directly. +Use **Cheerio Scraper** for static HTML — it's faster and cheaper because no browser is involved. Cheerio only sees the raw HTML response, so it can't reach content rendered by client-side JavaScript (single-page apps, infinite scroll, lazy-loaded content). Puppeteer- and Playwright-based scrapers run a real browser, so they handle dynamic content, click and scroll interactions, and login flows that Cheerio can't. -### When should I use Playwright instead of Cheerio? +The two libraries are similar; the main difference is browser support. **Puppeteer** is Chrome-only. **Playwright** also supports Firefox and WebKit. On Apify, you can choose: -If the page needs a real browser to render its content — and you want stronger support for Firefox and WebKit than Puppeteer offers — reach for [**Playwright Scraper**](https://apify.com/apify/playwright-scraper). +- [**Web Scraper**](https://apify.com/apify/web-scraper) — the simplest browser-based scraper, runs in the browser context, uses Puppeteer under the hood. +- [**Puppeteer Scraper**](https://apify.com/apify/puppeteer-scraper) — lower-level control over the Puppeteer library. +- [**Playwright Scraper**](https://apify.com/apify/playwright-scraper) — same idea, with Playwright. ### Can I build my own Actor with Cheerio? -Yes. The Cheerio Scraper Actor is open source — you can [fork it](https://github.com/apify/actor-scraper/tree/master/packages/actor-scraper/cheerio-scraper) and adjust it to your needs. Or build a custom Actor from scratch on top of [Cheerio](https://cheerio.js.org) using Crawlee's [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler) class — you get full control over the crawl with Cheerio's parsing API and Apify's platform features. +Yes. The Cheerio Scraper Actor is open source — [view the source on Apify](https://apify.com/apify/cheerio-scraper/source-code) or [fork it on GitHub](https://github.com/apify/actor-scraper/tree/master/packages/actor-scraper/cheerio-scraper) to adjust it to your needs. Or build a custom Actor from scratch — start from one of the [Cheerio-based Apify Actor templates](https://apify.com/templates?search=cheerio) and use Crawlee's [`CheerioCrawler`](https://crawlee.dev/js/api/cheerio-crawler) for full control over the crawl, with [Cheerio](https://cheerio.js.org)'s parsing API and Apify's platform features. ### Can I export Cheerio Scraper data using the Apify API? -Yes. The Apify API gives you programmatic access to your runs and datasets. To access the API using Node.js, use the `apify-client` [NPM package](https://apify.com/apify/cheerio-scraper/api/javascript). To access the API using Python, use the `apify-client` [PyPI package](https://apify.com/apify/cheerio-scraper/api/python). Check out the [Apify API reference](https://docs.apify.com/api/v2) docs or click on the [API tab](https://apify.com/apify/cheerio-scraper/api) for code examples. +Yes. The Apify API gives you programmatic access to your runs and datasets. To access the API using Node.js, use the `apify-client` [npm package](https://apify.com/apify/cheerio-scraper/api/javascript). To access the API using Python, use the `apify-client` [PyPI package](https://apify.com/apify/cheerio-scraper/api/python). Check out the [Apify API reference](https://docs.apify.com/api/v2) docs or click on the [API tab](https://apify.com/apify/cheerio-scraper/api) for code examples. ### Can I use Cheerio Scraper through an MCP server? @@ -651,11 +653,6 @@ You might also want to see these other resources: Apify's basic tool for web crawling and scraping. It uses a full Chrome browser to render dynamic content. A similar web scraping Actor to Puppeteer Scraper, but is simpler to use and only runs in the context of the browser. Uses the [Puppeteer](https://github.com/GoogleChrome/puppeteer) library. -- **Puppeteer Scraper** ([apify/puppeteer-scraper](https://apify.com/apify/puppeteer-scraper)) - - An Actor similar to Web Scraper, which provides lower-level control of the underlying - [Puppeteer](https://github.com/GoogleChrome/puppeteer) library and the ability to use server-side libraries. -- **Playwright Scraper** ([apify/playwright-scraper](https://apify.com/apify/playwright-scraper)) - - A similar web scraping Actor to Puppeteer Scraper, but using the [Playwright](https://github.com/microsoft/playwright) library instead. - [Actors documentation](https://docs.apify.com/actors) - Documentation for the Apify Actors cloud computing platform. - [Apify SDK documentation](https://sdk.apify.com) - Learn more about the tools required to run your own Apify Actors.