Claude Scraper Agent v1.13.0

Automated scraper generation system for data collection pipelines using Claude Code. Generates type-safe, quality-checked scrapers with tabbed question interface, version tracking, and maintenance tools.

Latest Updates (v1.13.0):

BA Agent v2.2.0: CRITICAL - Explicit tool usage enforcement (no pseudo-code), absolute hallucination prohibition, working directory awareness

Overview

This project contains:

Infrastructure Code: Base classes for Redis hash registry, JSON logging, Kafka notifications, and collection framework
Claude Code Plugin: Agents that generate production-ready scrapers
Tests: Unit tests for infrastructure components
Examples: Sample generated scrapers
Documentation: Usage guides and API references

Project Structure

claude_scraper_agent/
├── infrastructure/          # Base classes to be copied to sourcing project
│   ├── hash_registry.py
│   ├── logging_json.py
│   ├── kafka_utils.py
│   └── collection_framework.py
├── plugin/                  # Claude Code plugin
│   ├── plugin.json
│   ├── agents/
│   ├── commands/
│   └── skills/
├── tests/                   # Unit tests
├── examples/                # Example generated scrapers
└── docs/                    # Documentation

Installation

Method 1: Install from Marketplace (Recommended)

Install the Claude Code plugin directly from the marketplace:

# Add the marketplace
claude plugin marketplace add https://github.com/MarkEnverus/claude-scraper-agent

# Install the plugin
claude plugin install scraper-dev@scraper-agent-marketplace

That's it! Infrastructure files are automatically installed when you first run /create-scraper.

Method 2: Install via GitHub URL

Install directly from the GitHub repository:

# Add the GitHub repository as a marketplace
claude plugin marketplace add https://github.com/MarkEnverus/claude-scraper-agent

# Install the plugin
claude plugin install scraper-dev@scraper-agent-marketplace

Verify Installation

Restart Claude Code (close and reopen terminal)
Type /create-scraper - command should autocomplete
If available, the plugin is installed correctly

For detailed installation instructions and troubleshooting, see INSTALLATION.md.

Usage

Generate a New Scraper

# In Claude Code
/create-scraper

The agent will interview you about:

Data source name
Data type
Collection method (HTTP API, Website, FTP, etc.)
Authentication requirements
Update frequency

Fix an Existing Scraper

# In Claude Code
/fix-scraper

Use this when a scraper stops working due to:

API endpoint changes
Data format changes
Authentication updates
Import errors
Any other code issues

The agent will:

Scan for existing scrapers
Let you select which one to fix
Diagnose the problem
Propose and apply fixes

Update Scrapers to New Infrastructure

# Scan mode (just report what needs updating)
/update-scraper
/update-scraper --mode=scan

# Auto mode (propose and apply updates)
/update-scraper --mode=auto

Use this when infrastructure is updated to:

Sync scrapers with new framework versions
Add missing features (e.g., Kafka support)
Update imports after refactoring
Apply bug fixes and improvements

The agent will:

Scan all scrapers for version information
Report which scrapers need updates
(In auto mode) Propose updates and apply with approval
Preserve all custom business logic

Example Workflow

User: /create-scraper

Agent: I'll help you create a new scraper. Let me gather some information...

1. Data Source Name: NYISO
2. Data Type: hourly_load
3. Collection Method: HTTP REST API
4. Data Format: JSON
5. Update Frequency: hourly
6. Historical Support: Yes
7. Authentication: API Key

Agent: [Generates complete scraper with tests and documentation]

Generated files:
✓ sourcing/scraping/nyiso/scraper_nyiso_hourly_load_http.py
✓ sourcing/scraping/nyiso/tests/test_scraper_nyiso_hourly_load_http.py
✓ sourcing/scraping/nyiso/tests/fixtures/sample_response.json
✓ sourcing/scraping/nyiso/README.md

Enhanced Business Analyst Agent

This plugin includes an enhanced business analyst agent that can extract API documentation from JavaScript-rendered documentation sites.

Quick Start

# 1. Configure MCP Puppeteer (one-time setup)
echo '{
  "mcpServers": {
    "puppeteer": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-puppeteer"]
    }
  }
}' > ~/.claude/mcp.json

# 2. Restart Claude Code

# 3. Use the agent
/analyze-api https://data-exchange.misoenergy.org/api-details#api=pricing-api

Features

✅ Handles JavaScript-rendered sites - Uses browser automation when needed
✅ Falls back gracefully - Tries WebFetch first for speed
✅ Comprehensive analysis - Extracts endpoints, auth, parameters, response formats
✅ JIRA-ready output - Provides complete developer specifications

How It Works

WebFetch first: Fast extraction for static documentation
Browser automation: For single-page apps and JS-heavy sites
Intelligent detection: Automatically chooses the right approach
Fallback option: Guides users to copy/paste if browser tools unavailable

Prerequisites

For browser automation support (optional but recommended):

Node.js installed
MCP Puppeteer server configured (see Quick Start above)

Without browser automation:

Agent still works with WebFetch
Will ask you to copy/paste content from JS-heavy sites
Setup instructions provided when needed

Example Usage

Scenario 1: Modern API documentation portal

/analyze-api https://data-exchange.misoenergy.org/api-details#api=pricing-api

→ ba-enhanced agent launches
→ Detects JS-heavy site
→ Uses browser automation
→ Extracts full API documentation
→ Returns JIRA-ready specification

Scenario 2: Static documentation

/analyze-api https://api.example.com/docs

→ ba-enhanced agent launches
→ Uses WebFetch (fast)
→ Extracts documentation
→ Returns specification

Scenario 3: Browser tools not configured

/analyze-api https://js-heavy-site.com/docs

→ ba-enhanced agent launches
→ Tries WebFetch, gets minimal content
→ Checks for browser tools - not available
→ Provides setup instructions for MCP Puppeteer
→ Offers workaround: copy/paste from browser

Integration with Scraper Generator

Use the enhanced BA agent to analyze documentation, then create scrapers:

# Step 1: Analyze the API
/analyze-api https://data-exchange.misoenergy.org/api-details#api=pricing-api

# Step 2: BA agent provides detailed specification

# Step 3: Create scraper with that specification
/create-scraper
[Paste the specification from BA agent]

MCP Puppeteer vs WebFetch

Feature	WebFetch	MCP Puppeteer
Speed	⚡ Fast (~1-2 sec)	🐌 Slower (~5-10 sec)
JavaScript	❌ No execution	✅ Full JS execution
Static sites	✅ Perfect	⚠️ Overkill
SPA/React sites	❌ Returns empty	✅ Full content
Setup	✅ Built-in	⚠️ Requires Node.js

The agent automatically chooses the best tool for each site.

Architecture

Redis Hash Registry

Replaces DynamoDB-based deduplication with Redis:

# Key format: hash:{env}:{dgroup}:{sha256_hash}
hash:dev:nyiso_load_forecast:abc123...

# Automatic TTL (default 365 days)
# Environment namespacing (dev/staging/prod isolation)

S3 Date Partitioning

s3://bucket/sourcing/{dgroup}/year={YYYY}/month={MM}/day={DD}/{filename}

Example:
s3://bucket/sourcing/nyiso_load_forecast/year=2025/month=01/day=20/load_forecast_20250120_14.json

Note: Files are stored in their original format (uncompressed) to preserve file integrity and enable direct access.

JSON Structured Logging

{
  "timestamp": "2025-01-20T14:30:05Z",
  "level": "INFO",
  "logger": "sourcing_app",
  "message": "Successfully collected",
  "module": "scraper_nyiso_load",
  "function": "run_collection",
  "candidate": "load_forecast_20250120_14.json",
  "hash": "abc123...",
  "s3_path": "s3://..."
}

Configuration

Optional: Scraper Configuration Files

You can pre-configure scrapers using .scraper-dev.md files in your mono-repo:

Location:

sourcing/scraping/{dataSource}/{dataSet}/.scraper-dev.md

Example:

---
data_source: NYISO
data_type: load_forecast
collection_method: HTTP/REST API
api_base_url: https://api.nyiso.com/v1
api_endpoint: /load/hourly
---

Benefits:

Agent automatically finds and uses config values
Only prompts for missing information
Supports multiple projects in mono-repo
See .scraper-dev.example.md for full example

Environment Variables

# Redis
export REDIS_HOST=localhost
export REDIS_PORT=6379
export REDIS_DB=0

# S3
export S3_BUCKET=your-s3-bucket-name
export AWS_PROFILE=default  # or use IAM role

# Kafka (optional)
export KAFKA_CONNECTION_STRING="kafka://host:port/topic?security_protocol=SASL_PLAINTEXT&X_sasl_file=/path/to/creds"
# Or use environment variables for SASL
export SASL_USERNAME=your_username
export SASL_PASSWORD=your_password

# Source-specific API keys
export NYISO_API_KEY=your_key_here
export IBM_API_KEY=your_key_here

Development

Dependencies

For code generation and quality checks:

# Required for scraper execution
uv pip install redis boto3 click confluent-kafka pydantic

# Required for code quality (optional but recommended)
uv pip install mypy ruff

# Required for testing
uv pip install pytest pytest-cov fakeredis

Run Tests

# Install all dependencies
uv pip install pytest pytest-cov redis fakeredis boto3 confluent-kafka pydantic

# Run all tests
pytest tests/ -v --cov=infrastructure

# Run specific test
pytest tests/test_hash_registry.py -v

Code Quality

Generated scrapers are automatically checked with mypy and ruff:

# Manual check (auto-run during generation)
mypy sourcing/scraping/*/scraper_*.py
ruff check sourcing/scraping/*/scraper_*.py

# Auto-fix ruff issues
ruff check --fix sourcing/scraping/*/scraper_*.py

Generated Scraper Structure

Each scraper follows this pattern:

class MyCollector(BaseCollector):
    def generate_candidates(self, **kwargs) -> List[DownloadCandidate]:
        """Create list of files to download"""

    def collect_content(self, candidate: DownloadCandidate) -> bytes:
        """Download file via HTTP/FTP/etc."""

    def validate_content(self, content: bytes, candidate) -> bool:
        """Validate downloaded content"""

Features

v1.4.6 (Current) - Enhanced Business Analysis

✅ Enhanced BA agent - Browser automation for JavaScript-rendered documentation sites
✅ MCP Puppeteer integration - Extracts API docs from SPAs, React sites, modern portals
✅ /analyze-api command - Analyze API documentation with graceful fallbacks
✅ JSON format preference - Defaults to JSON for structured data with override option
✅ All v1.4.2 features (uv package management, tabbed questions, type-safe code)

v1.4.2 - Modern Python Tooling

✅ uv package management - Modernized to use uv instead of pip for faster, more reliable dependency installation
✅ All v1.4.1 features (tabbed question interface, text-input fields, type-safe code, quality checks)

v1.4.1 - Improved UX

✅ Tabbed question interface - Visual tabs with radio buttons for better user experience
✅ Text-input fields - Data Source, Data Type, and Format use free-form text input (no suggestions)
✅ All v1.4.0 features (type-safe code, quality checks, auto-fix, quality reporting)

v1.4.0 - Code Quality

✅ Type-safe code generation - All generated scrapers include comprehensive type hints
✅ Automatic quality checks - mypy and ruff checks run after generation
✅ Auto-fix capability - Automatically fixes ruff style issues
✅ Quality reporting - Clear reports on type errors and style issues
✅ pyproject.toml configuration - Automatically installs quality tool configs

v1.3.0 Features

✅ Version tracking in generated scrapers
✅ /fix-scraper command for debugging and fixing issues
✅ /update-scraper command for infrastructure sync
✅ Anti-hallucination rules to prevent example data contamination
✅ Infrastructure verification and auto-install

v1.0-1.2 Features

v1.2.0: Config file support (.scraper-dev.md)
v1.1.0: Self-contained Kafka support
v1.0.0: HTTP/REST API, Website parsing, FTP/SFTP, Email attachment scrapers
v1.0.0: Redis hash deduplication, S3 date partitioning, JSON logging
v1.0.0: 80%+ test coverage, auto-generated documentation

v2.0 (Future)

⏳ GraphQL scrapers
⏳ WebSocket scrapers
⏳ Bulk migration tool (update 1000s of existing scrapers)
⏳ Grafana metrics integration
⏳ Data catalog integration

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

Support

Issues: File on GitHub
Documentation: See docs/ directory
Examples: See examples/ directory

License

MIT License

Generated by: Claude Scraper Agent v1.4.6 Last Updated: 2025-12-05

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.claude-plugin		.claude-plugin
infrastructure		infrastructure
plugin		plugin
tests		tests
.gitignore		.gitignore
.scraper-dev.example.md		.scraper-dev.example.md
AGENT_CORE.md		AGENT_CORE.md
AGENT_VALIDATION_FIX.md		AGENT_VALIDATION_FIX.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
INSTALLATION.md		INSTALLATION.md
INSTALLATION_GUIDE.md		INSTALLATION_GUIDE.md
LANCEDB_PLUGIN_INTEGRATION.md		LANCEDB_PLUGIN_INTEGRATION.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
install-infrastructure.sh		install-infrastructure.sh
mypy.ini		mypy.ini
uninstall.sh		uninstall.sh

MarkEnverus/claude-scraper-agent

Folders and files

Latest commit

History

Repository files navigation

Claude Scraper Agent v1.13.0

Overview

Project Structure

Installation

Method 1: Install from Marketplace (Recommended)

Method 2: Install via GitHub URL

Verify Installation

Usage

Generate a New Scraper

Fix an Existing Scraper

Update Scrapers to New Infrastructure

Example Workflow

Enhanced Business Analyst Agent

Quick Start

Features

How It Works

Prerequisites

Example Usage

Integration with Scraper Generator

MCP Puppeteer vs WebFetch

Architecture

Redis Hash Registry

S3 Date Partitioning

JSON Structured Logging

Configuration

Optional: Scraper Configuration Files

Environment Variables

Development

Dependencies

Run Tests

Code Quality

Generated Scraper Structure

Features

v1.4.6 (Current) - Enhanced Business Analysis

v1.4.2 - Modern Python Tooling

v1.4.1 - Improved UX

v1.4.0 - Code Quality

v1.3.0 Features

v1.0-1.2 Features

v2.0 (Future)

Contributing

Support

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages