Automated scraper generation system for data collection pipelines using Claude Code. Generates type-safe, quality-checked scrapers with tabbed question interface, version tracking, and maintenance tools.
Latest Updates (v1.13.0):
- BA Agent v2.2.0: CRITICAL - Explicit tool usage enforcement (no pseudo-code), absolute hallucination prohibition, working directory awareness
This project contains:
- Infrastructure Code: Base classes for Redis hash registry, JSON logging, Kafka notifications, and collection framework
- Claude Code Plugin: Agents that generate production-ready scrapers
- Tests: Unit tests for infrastructure components
- Examples: Sample generated scrapers
- Documentation: Usage guides and API references
claude_scraper_agent/
├── infrastructure/ # Base classes to be copied to sourcing project
│ ├── hash_registry.py
│ ├── logging_json.py
│ ├── kafka_utils.py
│ └── collection_framework.py
├── plugin/ # Claude Code plugin
│ ├── plugin.json
│ ├── agents/
│ ├── commands/
│ └── skills/
├── tests/ # Unit tests
├── examples/ # Example generated scrapers
└── docs/ # Documentation
Install the Claude Code plugin directly from the marketplace:
# Add the marketplace
claude plugin marketplace add https://github.com/MarkEnverus/claude-scraper-agent
# Install the plugin
claude plugin install scraper-dev@scraper-agent-marketplaceThat's it! Infrastructure files are automatically installed when you first run /create-scraper.
Install directly from the GitHub repository:
# Add the GitHub repository as a marketplace
claude plugin marketplace add https://github.com/MarkEnverus/claude-scraper-agent
# Install the plugin
claude plugin install scraper-dev@scraper-agent-marketplace- Restart Claude Code (close and reopen terminal)
- Type
/create-scraper- command should autocomplete - If available, the plugin is installed correctly
For detailed installation instructions and troubleshooting, see INSTALLATION.md.
# In Claude Code
/create-scraperThe agent will interview you about:
- Data source name
- Data type
- Collection method (HTTP API, Website, FTP, etc.)
- Authentication requirements
- Update frequency
# In Claude Code
/fix-scraperUse this when a scraper stops working due to:
- API endpoint changes
- Data format changes
- Authentication updates
- Import errors
- Any other code issues
The agent will:
- Scan for existing scrapers
- Let you select which one to fix
- Diagnose the problem
- Propose and apply fixes
# Scan mode (just report what needs updating)
/update-scraper
/update-scraper --mode=scan
# Auto mode (propose and apply updates)
/update-scraper --mode=autoUse this when infrastructure is updated to:
- Sync scrapers with new framework versions
- Add missing features (e.g., Kafka support)
- Update imports after refactoring
- Apply bug fixes and improvements
The agent will:
- Scan all scrapers for version information
- Report which scrapers need updates
- (In auto mode) Propose updates and apply with approval
- Preserve all custom business logic
User: /create-scraper
Agent: I'll help you create a new scraper. Let me gather some information...
1. Data Source Name: NYISO
2. Data Type: hourly_load
3. Collection Method: HTTP REST API
4. Data Format: JSON
5. Update Frequency: hourly
6. Historical Support: Yes
7. Authentication: API Key
Agent: [Generates complete scraper with tests and documentation]
Generated files:
✓ sourcing/scraping/nyiso/scraper_nyiso_hourly_load_http.py
✓ sourcing/scraping/nyiso/tests/test_scraper_nyiso_hourly_load_http.py
✓ sourcing/scraping/nyiso/tests/fixtures/sample_response.json
✓ sourcing/scraping/nyiso/README.md
This plugin includes an enhanced business analyst agent that can extract API documentation from JavaScript-rendered documentation sites.
# 1. Configure MCP Puppeteer (one-time setup)
echo '{
"mcpServers": {
"puppeteer": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-puppeteer"]
}
}
}' > ~/.claude/mcp.json
# 2. Restart Claude Code
# 3. Use the agent
/analyze-api https://data-exchange.misoenergy.org/api-details#api=pricing-api- ✅ Handles JavaScript-rendered sites - Uses browser automation when needed
- ✅ Falls back gracefully - Tries WebFetch first for speed
- ✅ Comprehensive analysis - Extracts endpoints, auth, parameters, response formats
- ✅ JIRA-ready output - Provides complete developer specifications
- WebFetch first: Fast extraction for static documentation
- Browser automation: For single-page apps and JS-heavy sites
- Intelligent detection: Automatically chooses the right approach
- Fallback option: Guides users to copy/paste if browser tools unavailable
For browser automation support (optional but recommended):
- Node.js installed
- MCP Puppeteer server configured (see Quick Start above)
Without browser automation:
- Agent still works with WebFetch
- Will ask you to copy/paste content from JS-heavy sites
- Setup instructions provided when needed
Scenario 1: Modern API documentation portal
/analyze-api https://data-exchange.misoenergy.org/api-details#api=pricing-api
→ ba-enhanced agent launches
→ Detects JS-heavy site
→ Uses browser automation
→ Extracts full API documentation
→ Returns JIRA-ready specification
Scenario 2: Static documentation
/analyze-api https://api.example.com/docs
→ ba-enhanced agent launches
→ Uses WebFetch (fast)
→ Extracts documentation
→ Returns specification
Scenario 3: Browser tools not configured
/analyze-api https://js-heavy-site.com/docs
→ ba-enhanced agent launches
→ Tries WebFetch, gets minimal content
→ Checks for browser tools - not available
→ Provides setup instructions for MCP Puppeteer
→ Offers workaround: copy/paste from browser
Use the enhanced BA agent to analyze documentation, then create scrapers:
# Step 1: Analyze the API
/analyze-api https://data-exchange.misoenergy.org/api-details#api=pricing-api
# Step 2: BA agent provides detailed specification
# Step 3: Create scraper with that specification
/create-scraper
[Paste the specification from BA agent]
| Feature | WebFetch | MCP Puppeteer |
|---|---|---|
| Speed | ⚡ Fast (~1-2 sec) | 🐌 Slower (~5-10 sec) |
| JavaScript | ❌ No execution | ✅ Full JS execution |
| Static sites | ✅ Perfect | |
| SPA/React sites | ❌ Returns empty | ✅ Full content |
| Setup | ✅ Built-in |
The agent automatically chooses the best tool for each site.
Replaces DynamoDB-based deduplication with Redis:
# Key format: hash:{env}:{dgroup}:{sha256_hash}
hash:dev:nyiso_load_forecast:abc123...
# Automatic TTL (default 365 days)
# Environment namespacing (dev/staging/prod isolation)s3://bucket/sourcing/{dgroup}/year={YYYY}/month={MM}/day={DD}/{filename}
Example:
s3://bucket/sourcing/nyiso_load_forecast/year=2025/month=01/day=20/load_forecast_20250120_14.json
Note: Files are stored in their original format (uncompressed) to preserve file integrity and enable direct access.
{
"timestamp": "2025-01-20T14:30:05Z",
"level": "INFO",
"logger": "sourcing_app",
"message": "Successfully collected",
"module": "scraper_nyiso_load",
"function": "run_collection",
"candidate": "load_forecast_20250120_14.json",
"hash": "abc123...",
"s3_path": "s3://..."
}You can pre-configure scrapers using .scraper-dev.md files in your mono-repo:
Location:
sourcing/scraping/{dataSource}/{dataSet}/.scraper-dev.md
Example:
---
data_source: NYISO
data_type: load_forecast
collection_method: HTTP/REST API
api_base_url: https://api.nyiso.com/v1
api_endpoint: /load/hourly
---Benefits:
- Agent automatically finds and uses config values
- Only prompts for missing information
- Supports multiple projects in mono-repo
- See
.scraper-dev.example.mdfor full example
# Redis
export REDIS_HOST=localhost
export REDIS_PORT=6379
export REDIS_DB=0
# S3
export S3_BUCKET=your-s3-bucket-name
export AWS_PROFILE=default # or use IAM role
# Kafka (optional)
export KAFKA_CONNECTION_STRING="kafka://host:port/topic?security_protocol=SASL_PLAINTEXT&X_sasl_file=/path/to/creds"
# Or use environment variables for SASL
export SASL_USERNAME=your_username
export SASL_PASSWORD=your_password
# Source-specific API keys
export NYISO_API_KEY=your_key_here
export IBM_API_KEY=your_key_hereFor code generation and quality checks:
# Required for scraper execution
uv pip install redis boto3 click confluent-kafka pydantic
# Required for code quality (optional but recommended)
uv pip install mypy ruff
# Required for testing
uv pip install pytest pytest-cov fakeredis# Install all dependencies
uv pip install pytest pytest-cov redis fakeredis boto3 confluent-kafka pydantic
# Run all tests
pytest tests/ -v --cov=infrastructure
# Run specific test
pytest tests/test_hash_registry.py -vGenerated scrapers are automatically checked with mypy and ruff:
# Manual check (auto-run during generation)
mypy sourcing/scraping/*/scraper_*.py
ruff check sourcing/scraping/*/scraper_*.py
# Auto-fix ruff issues
ruff check --fix sourcing/scraping/*/scraper_*.pyEach scraper follows this pattern:
class MyCollector(BaseCollector):
def generate_candidates(self, **kwargs) -> List[DownloadCandidate]:
"""Create list of files to download"""
def collect_content(self, candidate: DownloadCandidate) -> bytes:
"""Download file via HTTP/FTP/etc."""
def validate_content(self, content: bytes, candidate) -> bool:
"""Validate downloaded content"""- ✅ Enhanced BA agent - Browser automation for JavaScript-rendered documentation sites
- ✅ MCP Puppeteer integration - Extracts API docs from SPAs, React sites, modern portals
- ✅ /analyze-api command - Analyze API documentation with graceful fallbacks
- ✅ JSON format preference - Defaults to JSON for structured data with override option
- ✅ All v1.4.2 features (uv package management, tabbed questions, type-safe code)
- ✅ uv package management - Modernized to use uv instead of pip for faster, more reliable dependency installation
- ✅ All v1.4.1 features (tabbed question interface, text-input fields, type-safe code, quality checks)
- ✅ Tabbed question interface - Visual tabs with radio buttons for better user experience
- ✅ Text-input fields - Data Source, Data Type, and Format use free-form text input (no suggestions)
- ✅ All v1.4.0 features (type-safe code, quality checks, auto-fix, quality reporting)
- ✅ Type-safe code generation - All generated scrapers include comprehensive type hints
- ✅ Automatic quality checks - mypy and ruff checks run after generation
- ✅ Auto-fix capability - Automatically fixes ruff style issues
- ✅ Quality reporting - Clear reports on type errors and style issues
- ✅ pyproject.toml configuration - Automatically installs quality tool configs
- ✅ Version tracking in generated scrapers
- ✅
/fix-scrapercommand for debugging and fixing issues - ✅
/update-scrapercommand for infrastructure sync - ✅ Anti-hallucination rules to prevent example data contamination
- ✅ Infrastructure verification and auto-install
- v1.2.0: Config file support (.scraper-dev.md)
- v1.1.0: Self-contained Kafka support
- v1.0.0: HTTP/REST API, Website parsing, FTP/SFTP, Email attachment scrapers
- v1.0.0: Redis hash deduplication, S3 date partitioning, JSON logging
- v1.0.0: 80%+ test coverage, auto-generated documentation
- ⏳ GraphQL scrapers
- ⏳ WebSocket scrapers
- ⏳ Bulk migration tool (update 1000s of existing scrapers)
- ⏳ Grafana metrics integration
- ⏳ Data catalog integration
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
- Issues: File on GitHub
- Documentation: See
docs/directory - Examples: See
examples/directory
MIT License
Generated by: Claude Scraper Agent v1.4.6 Last Updated: 2025-12-05