-
Notifications
You must be signed in to change notification settings - Fork 714
feat: Add StagehandCrawler with AI-powered browser automation
#1854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Mantisus
wants to merge
16
commits into
apify:master
Choose a base branch
from
Mantisus:crawlee-stagehand-crawler
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
1febb65
add stagehand plugin
Mantisus 85f18de
update typing for stagehand
Mantisus a47b836
update plugin
Mantisus edefde7
Merge branch 'master' into crawlee-stagehand-crawler
Mantisus 62b0c66
synchronize params between modules
Mantisus 15ad00a
Merge branch 'master' into crawlee-stagehand-crawler
Mantisus 0424afc
fix docs
Mantisus bd45915
add tests
Mantisus 33503cb
add docs and fingerprint headers
Mantisus 4115ae2
fix docs
Mantisus 476aa4f
fixes
Mantisus 9645808
fix test
Mantisus fb34b74
resolve conflict and update stagehand
Mantisus a86c049
fix docstring
Mantisus ed6ea6f
Merge branch 'master' into crawlee-stagehand-crawler
Mantisus 95cc773
fix docs style
Mantisus File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
101 changes: 0 additions & 101 deletions
101
docs/guides/code_examples/playwright_crawler_stagehand/browser_classes.py
This file was deleted.
Oops, something went wrong.
66 changes: 0 additions & 66 deletions
66
docs/guides/code_examples/playwright_crawler_stagehand/stagehand_run.py
This file was deleted.
Oops, something went wrong.
57 changes: 0 additions & 57 deletions
57
docs/guides/code_examples/playwright_crawler_stagehand/support_classes.py
This file was deleted.
Oops, something went wrong.
47 changes: 47 additions & 0 deletions
47
docs/guides/code_examples/stagehand_crawler/basic_example.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| import asyncio | ||
| from typing import cast | ||
|
|
||
| from crawlee.browsers import StagehandOptions | ||
| from crawlee.crawlers import StagehandCrawler, StagehandCrawlingContext | ||
|
|
||
|
|
||
| async def main() -> None: | ||
| crawler = StagehandCrawler( | ||
| stagehand_options=StagehandOptions( | ||
| model_api_key='your-openai-api-key', | ||
| model='openai/gpt-4.1-mini', | ||
| ), | ||
| max_requests_per_crawl=5, | ||
| ) | ||
|
|
||
| @crawler.router.default_handler | ||
| async def handler(context: StagehandCrawlingContext) -> None: | ||
| context.log.info(f'Processing {context.request.url} ...') | ||
|
|
||
| # Dismiss overlays or interact with the page using natural language. | ||
| await context.page.act(input='Click the accept cookies button if present') | ||
|
|
||
| # Extract data from the page using AI. | ||
| extracted = await context.page.extract( | ||
| instruction='Get the page title and the main heading text', | ||
| schema={ | ||
| 'type': 'object', | ||
| 'properties': { | ||
| 'title': {'type': 'string'}, | ||
| 'heading': {'type': 'string'}, | ||
| }, | ||
| }, | ||
| ) | ||
|
|
||
| extract_result = extracted.data.result | ||
|
|
||
| if isinstance(extract_result, dict): | ||
| # Push extracted data to the dataset | ||
| # Use `cast()` to provide a more specific type hint for the extracted data. | ||
| await context.push_data(cast('dict[str, str | None]', extract_result)) | ||
|
|
||
| await crawler.run(['https://example.com']) | ||
|
|
||
|
|
||
| if __name__ == '__main__': | ||
| asyncio.run(main()) |
37 changes: 37 additions & 0 deletions
37
docs/guides/code_examples/stagehand_crawler/browserbase_example.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| import asyncio | ||
| from typing import cast | ||
|
|
||
| from crawlee.browsers import StagehandOptions | ||
| from crawlee.crawlers import StagehandCrawler, StagehandCrawlingContext | ||
|
|
||
|
|
||
| async def main() -> None: | ||
| # Use Browserbase cloud browser instead of a local Chromium instance. | ||
| crawler = StagehandCrawler( | ||
| stagehand_options=StagehandOptions( | ||
| env='BROWSERBASE', | ||
| browserbase_api_key='your-browserbase-api-key', | ||
| project_id='your-project-id', | ||
| model_api_key='your-openai-api-key', | ||
| model='openai/gpt-4.1-mini', | ||
| ), | ||
| max_requests_per_crawl=5, | ||
| ) | ||
|
|
||
| @crawler.router.default_handler | ||
| async def handler(context: StagehandCrawlingContext) -> None: | ||
| context.log.info(f'Processing {context.request.url} ...') | ||
|
|
||
| extracted = await context.page.extract( | ||
| instruction='Get the main content of the page', | ||
| ) | ||
|
|
||
| extract_result = extracted.data.result | ||
|
|
||
| await context.push_data(cast('dict[str, str | None]', extract_result)) | ||
|
|
||
| await crawler.run(['https://example.com']) | ||
|
|
||
|
|
||
| if __name__ == '__main__': | ||
| asyncio.run(main()) | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe explicit type rather than cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, in
Stagehand,extracted.data.resultis typed asobject. That's why I usedcast.