Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion docs/guides/architecture_overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ class PlaywrightCrawler

class AdaptivePlaywrightCrawler

class StagehandCrawler

%% ========================
%% Inheritance arrows
%% ========================
Expand All @@ -63,6 +65,7 @@ BasicCrawler --|> AdaptivePlaywrightCrawler
AbstractHttpCrawler --|> HttpCrawler
AbstractHttpCrawler --|> ParselCrawler
AbstractHttpCrawler --|> BeautifulSoupCrawler
PlaywrightCrawler --|> StagehandCrawler
```

### HTTP crawlers
Expand All @@ -79,7 +82,10 @@ You can learn more about HTTP crawlers in the [HTTP crawlers guide](./http-crawl

### Browser crawlers

Browser crawlers use a real browser to render pages, enabling scraping of sites that require JavaScript. They manage browser instances, pages, and context lifecycles. Currently, the only browser crawler is <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, which utilizes the [Playwright](https://playwright.dev/) library. Playwright provides a high-level API for controlling and navigating browsers. You can learn more about <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, its features, and how it internally manages browser instances in the [Playwright crawler guide](./playwright-crawler).
Browser crawlers use a real browser to render pages, enabling scraping of sites that require JavaScript. They manage browser instances, pages, and context lifecycles. Crawlee provides two browser crawlers:

- <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> utilizes the [Playwright](https://playwright.dev/) library and provides a high-level API for controlling and navigating browsers. You can learn more about it in the [Playwright crawler guide](./playwright-crawler).
- <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink> extends `PlaywrightCrawler` with AI-powered browser automation via [Stagehand](https://github.com/browserbase/stagehand). It adds natural-language methods (`act`, `extract`, `observe`, `execute`) directly on the page object. You can learn more about it in the [Stagehand crawler guide](./stagehand-crawler).

### Adaptive crawler

Expand Down Expand Up @@ -122,6 +128,12 @@ class AdaptivePlaywrightPreNavCrawlingContext

class AdaptivePlaywrightCrawlingContext

class StagehandPreNavCrawlingContext

class StagehandPostNavCrawlingContext

class StagehandCrawlingContext

%% ========================
%% Inheritance arrows
%% ========================
Expand All @@ -143,6 +155,12 @@ PlaywrightPreNavCrawlingContext --|> PlaywrightCrawlingContext
BasicCrawlingContext --|> AdaptivePlaywrightPreNavCrawlingContext

ParsedHttpCrawlingContext --|> AdaptivePlaywrightCrawlingContext

PlaywrightPreNavCrawlingContext --|> StagehandPreNavCrawlingContext

StagehandPreNavCrawlingContext --|> StagehandPostNavCrawlingContext

StagehandPostNavCrawlingContext --|> StagehandCrawlingContext
```

They have a similar inheritance structure as the crawlers, with the base class being <ApiLink to="class/BasicCrawlingContext">`BasicCrawlingContext`</ApiLink>. The specific crawling contexts are:
Expand All @@ -154,6 +172,9 @@ They have a similar inheritance structure as the crawlers, with the base class b
- <ApiLink to="class/PlaywrightCrawlingContext">`PlaywrightCrawlingContext`</ApiLink> for Playwright crawlers.
- <ApiLink to="class/AdaptivePlaywrightPreNavCrawlingContext">`AdaptivePlaywrightPreNavCrawlingContext`</ApiLink> for Adaptive Playwright crawlers before the page is navigated.
- <ApiLink to="class/AdaptivePlaywrightCrawlingContext">`AdaptivePlaywrightCrawlingContext`</ApiLink> for Adaptive Playwright crawlers.
- <ApiLink to="class/StagehandPreNavCrawlingContext">`StagehandPreNavCrawlingContext`</ApiLink> for Stagehand crawlers before the page is navigated.
- <ApiLink to="class/StagehandPostNavCrawlingContext">`StagehandPostNavCrawlingContext`</ApiLink> for Stagehand crawlers after the page is navigated.
- <ApiLink to="class/StagehandCrawlingContext">`StagehandCrawlingContext`</ApiLink> for Stagehand crawlers.

## Storages

Expand Down
Empty file.

This file was deleted.

This file was deleted.

This file was deleted.

47 changes: 47 additions & 0 deletions docs/guides/code_examples/stagehand_crawler/basic_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
import asyncio
from typing import cast

from crawlee.browsers import StagehandOptions
from crawlee.crawlers import StagehandCrawler, StagehandCrawlingContext


async def main() -> None:
crawler = StagehandCrawler(
stagehand_options=StagehandOptions(
model_api_key='your-openai-api-key',
model='openai/gpt-4.1-mini',
),
max_requests_per_crawl=5,
)

@crawler.router.default_handler
async def handler(context: StagehandCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

# Dismiss overlays or interact with the page using natural language.
await context.page.act(input='Click the accept cookies button if present')

# Extract data from the page using AI.
extracted = await context.page.extract(
instruction='Get the page title and the main heading text',
schema={
'type': 'object',
'properties': {
'title': {'type': 'string'},
'heading': {'type': 'string'},
},
},
)

extract_result = extracted.data.result

if isinstance(extract_result, dict):
# Push extracted data to the dataset
# Use `cast()` to provide a more specific type hint for the extracted data.
await context.push_data(cast('dict[str, str | None]', extract_result))

await crawler.run(['https://example.com'])


if __name__ == '__main__':
asyncio.run(main())
37 changes: 37 additions & 0 deletions docs/guides/code_examples/stagehand_crawler/browserbase_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import asyncio
from typing import cast

from crawlee.browsers import StagehandOptions
from crawlee.crawlers import StagehandCrawler, StagehandCrawlingContext


async def main() -> None:
# Use Browserbase cloud browser instead of a local Chromium instance.
crawler = StagehandCrawler(
stagehand_options=StagehandOptions(
env='BROWSERBASE',
browserbase_api_key='your-browserbase-api-key',
project_id='your-project-id',
model_api_key='your-openai-api-key',
model='openai/gpt-4.1-mini',
),
max_requests_per_crawl=5,
)

@crawler.router.default_handler
async def handler(context: StagehandCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

extracted = await context.page.extract(
instruction='Get the main content of the page',
)

extract_result = extracted.data.result

await context.push_data(cast('dict[str, str | None]', extract_result))
Comment on lines +29 to +31
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe explicit type rather than cast?

Suggested change
extract_result = extracted.data.result
await context.push_data(cast('dict[str, str | None]', extract_result))
extract_result: dict[str, str | None] = extracted.data.result
await context.push_data(extract_result)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, in Stagehand, extracted.data.result is typed as object. That's why I used cast.

extract_result: dict[str, str | None] = extracted.data.result
                ---------------------   ^^^^^^^^^^^^^^^^^^^^^ Incompatible value of type `object`


await crawler.run(['https://example.com'])


if __name__ == '__main__':
asyncio.run(main())
Loading
Loading