A service that turns SaaS and internal APIs into OpenTelemetry metrics and logs, feeding business and process observability directly into your existing OTEL stack. Provides filtering and deduplication options out of the box so that you do not pollute your stack with repeated duplicates.
API2OTEL is a YAMLβdriven async scraper that turns uninstrumented HTTP/JSON APIs into firstβclass OpenTelemetry metrics and logsβwithout building oneβoff exporters or wiring custom code. Point it at the APIs that hide your operational or business state and it will poll, extract, shape, deduplicate, and emit telemetry through the OTEL pipeline you already run.
Most API surfaces (SaaS, internal platforms, scheduled batch endpoints) already contain answers to questions teams ask in dashboards: queue depth, job runtimes, sync failures, external SLAs, integration throughput. They rarely expose native OTEL or Prometheus signals. The usual "solution" becomes a patchwork of cron scripts, throwaway Python, or bespoke collectors that are hard to extend and impossible to standardize.
API2OTEL focuses on turning that glue work into a declarative layer:
- Define sources, auth, scrape cadence, and time windows in one file.
- Map raw JSON fields to gauges, counters, histograms, and structured logs.
- Apply record filtering, volume caps, and fingerprintβbased deduplication so backends stay lean.
- Run historical backfills (range scrapes) and ongoing incremental polls side by side.
- Observe the scraper itself (selfβtelemetry) to catch stalls, slow scrapes, or ineffective dedupe.
Instead of "write a mini integration for every API", you version a config, commit it, and gain portable, reviewable observability coverage.
This service is designed to integrate cleanly onto an existing observability stack (or create a new one using this example).
Most teams run critical flows on systems they don't control:
- SaaS platforms: Workday, ServiceNow, Jira, GitHub, Salesforceβ¦
- Internal tools: Only expose REST/HTTP APIs or "download report" endpoints
- Batch runners: Emit JSON, not OTEL signals
They already have an observability stack built on OpenTelemetry, but bridging those APIs typically ends up as messy one-offs:
- Python scripts + cron that nobody owns
- SaaS-specific "exporters" that can't be reused across products
- JSON dumps and screenshots instead of real metrics
Make this reusable and standard:
API data β extract records β emit OTLP β your collector
No code changes. No vendor lock-in. Everything flows through your existing OTEL stack.
otel-api-scraper is a config-driven async service that:
- Polls any HTTP API or data endpoint
- Extracts records from JSON responses
- Maps them to OTEL metrics (gauges, counters, histograms) and logs
- Emits everything via OTLP to your collector
[ APIs / data endpoints ]
β HTTP
otel-api-scraper (this)
β OTLP (gRPC/HTTP)
OpenTelemetry Collector
β
Prometheus / Grafana / Loki / β¦
Entirely YAML-driven. Add/update sources by editing configβno code needed.
- Declare every source in YAML: frequency (5min, 1h, 1d, β¦), scrape mode (range with start/end or relative windows; instant snapshots), time formats (global + per-source), and query params (time keys, extra args, URL encoding rules).
- Add/change sources by editing configβno code.
- Check out the config template to learn more about configuration parameters.
- Built-in: Basic (env creds), API key headers, OAuth (static token or runtime via HTTP GET/POST with configurable body/headers and response key), Azure AD client credentials.
- Tokens are fetched asynchronously and reused per source.
- Asyncio/httpx end-to-end.
- Global concurrency limit plus per-source limits.
- Range scrapes can split into sub-windows and run in parallel within limitsβstay within rate caps while scraping multiple systems.
- Drop rules, keep rules, and per-scrape caps: "don't emit INFO," "only these IDs," "cap at N records."
- Protects metrics backends and logging costs from noisy sources.
- Fingerprints stored in sqlite or Valkey (Redis-compatible) with configurable TTL and keys/modes.
- Enables historical scrapes and frequent "last N hours" polls without duplicate spam.
- Scheduler/last-success share the same backend.
- Metrics live in config: gauges/counters/histograms from
dataKeyorfixedValue; attributes can emit counters viaasMetric; per-sourceemitLogs; severity mapping from record fields. - Labels come from attributes and optional metric labels as configured.
- Records become OTEL logs with severity derived from a configured field; attributes align with metrics for easy pivots.
- Per-source
emitLogslets you opt out where logs aren't useful.
Use this when:
- β You need metrics/logs about business processes or integrations that only exist as API responses
- β You already have an OTEL collector and want to feed it more sources
- β You need real auth (OAuth, Azure AD) and time windows (historical backfills, relative ranges)
- β You want to deduplicate data or cap volumes with filtering
You probably don't need this when:
- β The system already emits OTLP or Prometheus nativelyβjust scrape it directly
- β You only need simple uptime checksβuse the collector's
httpcheckreceiver - β You're fine writing a one-off Go receiver for a single vendor
Prerequisites
- Python 3.10+
- A running OTEL collector listening for OTLP (gRPC or HTTP)
uvorpipfor Python dependencies
-
Install
- Using uv (recommended):
uv sync - Or with plain pip:
pip install .
- Using uv (recommended):
-
Create a config
- Copy the template:
cp config.yaml.template config.yaml - Set at least:
scraper.otelCollectorEndpointβ your collector's OTLP endpoint.- One simple source pointing at an HTTP endpoint you control.
Example minimal source (simplified):
scraper: otelCollectorEndpoint: "http://otel-collector:4318" otelTransport: "http" # or "grpc" sources: - name: JSON-Placeholder baseUrl: https://jsonplaceholder.typicode.com endpoint: /posts frequency: 5m scrape: type: instant counterReadings: - name: invoke_counts fixedValue: 1 - name: sum_of_ids dataKey: id unit: "1" attributes: - name: user_id dataKey: userId - name: post_id dataKey: id emitLogs: true runFirstScrape: false
(Use your real API instead of httpbin; full config semantics are documented in the configuration docs.)
- Copy the template:
-
Run the scraper
- With uv:
uv run otel-api-scraper --config /app/config.yaml - Or with the installed console script:
otel-api-scraper --config /app/config.yaml
By default, it will schedule the configured source(s), scrape the API, and emit metrics/logs via OTLP to the collector.
- With uv:
-
Check your telemetry
- In your collector logs, look for incoming metrics/logs from service
otel-api-scraper. - In Prometheus/Grafana/Loki, query for the metric/log names you configured.
- In your collector logs, look for incoming metrics/logs from service
Get the scraper + OTEL collector + Prometheus + Grafana + Loki running in one command:
Prerequisites
- Docker & Docker Compose
- No Python installation needed
-
Start the full stack
cd "docs/LOCAL_TESTING" docker-compose up -d
-
Update your config (optional)
- Edit
config.yamlin the repo root - The compose setup mounts it into the scraper container
- Restart the scraper to apply changes:
docker-compose restart scraper
- Edit
-
Access the dashboards
- Grafana: http://localhost:3000 (default user:
admin/admin) - Prometheus: http://localhost:9090
- Loki: http://localhost:3100
- Grafana: http://localhost:3000 (default user:
-
View scraper logs
docker-compose logs -f scraper
-
Stop everything
docker-compose down -v
For more details, see LOCAL_TESTING.md and LOCAL TESTING/ config directory.
The scraper includes an optional FastAPI-based Admin API for runtime control and monitoring.
scraper:
enableAdminApi: true
servicePort: 8080 # Port for admin API (default: 80)
adminSecretEnv: "ADMIN_SECRET" # Environment variable containing the bearer tokenSet the admin token via environment variable:
export ADMIN_SECRET="your-secure-token-here"Once enabled, interactive API documentation is available at:
- Swagger UI:
http://localhost:8080/docs(orhttp://<hostname>:<port>/docs) - ReDoc:
http://localhost:8080/redoc
All admin endpoints require bearer token authentication:
curl -H "Authorization: Bearer your-secure-token-here" http://localhost:8080/health| Endpoint | Method | Auth Required | Description |
|---|---|---|---|
/health |
GET | β No | Health check - returns 200 OK if service is running |
/config |
GET | β Yes | Returns the effective configuration as JSON (with sensitive values redacted) |
/sources |
GET | β Yes | Lists all configured sources with their settings |
/scrape/{source_name} |
POST | β Yes | Triggers an immediate scrape for the specified source (bypasses scheduler) |
Example Usage:
# Check health (no auth needed)
curl http://localhost:8080/health
# Get current configuration
curl -H "Authorization: Bearer ${ADMIN_SECRET}" http://localhost:8080/config
# List all sources
curl -H "Authorization: Bearer ${ADMIN_SECRET}" http://localhost:8080/sources
# Manually trigger a scrape
curl -X POST -H "Authorization: Bearer ${ADMIN_SECRET}" \
http://localhost:8080/scrape/my-source-nameπ§ Admin experience enhancements are on the roadmap! See here
The scraper can emit its own operational metrics and logs when enableSelfTelemetry: true is configured. This allows you to monitor the scraper's health, performance, and behavior.
scraper:
enableSelfTelemetry: true # Enable self-monitoring metrics
otelCollectorEndpoint: "http://otel-collector:4318"
serviceName: "otel-api-scraper"When enabled, the following metrics are emitted:
| Metric Name | Type | Unit | Description | Attributes |
|---|---|---|---|---|
| Scrape Execution | ||||
scraper_scrape_duration_seconds |
Histogram | s |
Distribution of scrape execution times | source, status, api_type |
scraper_scrape_total |
Counter | 1 |
Total number of scrapes executed | source, status, api_type |
scraper_last_scrape_duration_seconds |
Gauge | s |
Duration of the most recent scrape | source, status, api_type |
scraper_last_records_emitted |
Gauge | 1 |
Number of records emitted in most recent scrape | source, status, api_type |
| Deduplication | ||||
scraper_dedupe_hits_total |
Counter | 1 |
Total fingerprints skipped (already seen) | source, api_type |
scraper_dedupe_misses_total |
Counter | 1 |
Total fingerprints processed (new records) | source, api_type |
scraper_dedupe_total |
Counter | 1 |
Total records processed through dedupe | source, api_type |
scraper_dedupe_hit_rate |
Gauge | 1 |
Ratio of hits to total (0.0 to 1.0) | source, api_type |
| Cleanup Jobs | ||||
scraper_cleanup_duration_seconds |
Histogram | s |
Distribution of cleanup job execution times | job, backend |
scraper_cleanup_last_duration_seconds |
Gauge | s |
Duration of the most recent cleanup | job, backend |
scraper_cleanup_items_total |
Counter | 1 |
Total items cleaned across all jobs | job, backend |
scraper_cleanup_last_items |
Gauge | 1 |
Number of items cleaned in most recent run | job, backend |
Common Attributes:
source: Name of the source being scrapedstatus:successorerrorapi_type:instantorrangejob: Cleanup job type (fingerprint_cleanup,orphan_cleanup)backend: Storage backend (sqlite,valkey)
# Scrape success rate
rate(scraper_scrape_total{status="success"}[5m]) / rate(scraper_scrape_total[5m])
# Average scrape duration
rate(scraper_scrape_duration_seconds_sum[5m]) / rate(scraper_scrape_duration_seconds_count[5m])
# Deduplication efficiency
scraper_dedupe_hit_rate * 100
π For detailed examples, PromQL queries, alerting rules, and best practices, see TELEMETRY.md
The scraper is built as an async-first Python application with clear separation of concerns.
Click to view details on core components
#### **Config & Validation** (`config.py`)
- Pydantic models for strict config schema validation.
- Environment variable resolution via `${VAR_NAME}` syntax.
- Supports: sources, auth types, scrape modes, metrics, filters, attributes, etc.
- Fails fast with clear errors on schema violations.
#### **HTTP Client** (`http_client.py`)
- `AsyncHttpClient`: Wraps `httpx.AsyncClient` with connection pooling and global semaphore.
- Auth strategies (pluggable):
- `BasicAuth`: Encodes username/password.
- `ApiKeyAuth`: Injects header (e.g., `X-API-Key`).
- `OAuthAuth`: Static token or runtime fetch with configurable body/headers.
- `AzureADAuth`: Client credentials flow to Azure token endpoint.
- Token caching: OAuth tokens fetched once and reused until expiry.
- All requests are async and respect concurrency limits.
#### **Scraper Engine** (`scraper_engine.py`)
- **Window computation**: For range scrapes, calculates start/end based on frequency and last scrape time. Supports relative windows ("last N hours") and historical backfills.
- **Sub-window splitting**: If `parallelWindow` configured, splits a large range into smaller chunks for parallel scraping (e.g., 24-hour range β 12 Γ 2-hour chunks).
- **Concurrency orchestration**: Maintains per-source semaphores; enforces global limit via shared semaphore in `AsyncHttpClient`.
- **Response handling**: Extracts records via `dataKey` using flexible nested path syntax (dot notation, array indexing, slicing).
- **Error resilience**: Catches HTTP errors, response parsing errors, and logs them without crashing the scraper.
#### **Record Pipeline** (`pipeline.py`)
- **Filtering**: Applies drop/keep rules (any/all predicates with `equals`, `not_equals`, `in`, `regex` matchers).
- **Limits**: Caps records per scrape to prevent memory/storage spikes.
- **Delta detection**:
- Fingerprints records (MD5 hash of full record or specified keys).
- Checks fingerprint store (sqlite or Valkey).
- Only emits records with unseen fingerprints (within TTL window).
- Supports per-source TTL/max entries overrides.
#### **Fingerprint Store** (`fingerprints.py`)
- **Backend options**: SQLite (local file) or Valkey (distributed).
- **Storage**: Maps `(source_name, fingerprint)` β `(timestamp, ttl_expires_at)`.
- **Cleanup**: Background task periodically removes expired fingerprints.
- **Orphan cleanup**: Removes fingerprints for sources that have been removed from config.
#### **State Store** (`state.py`)
- Tracks last successful scrape timestamp per source.
- Persists in same backend as fingerprint store.
- Enables resumption after restarts: next scrape picks up where the last one ended (no re-scraping old data).
#### **Telemetry** (`telemetry.py`)
- **SDK initialization**: Sets up OTEL SDK with OTLP exporter (gRPC or HTTP).
- **Metric emission**:
- **Gauges**: Current values from records (each record sets gauge to its value).
- **Counters**: Aggregate (sum field values, fixed value per record, or add 1 per record).
- **Histograms**: Distributions with explicit bucket boundaries.
- Labels derived from source `attributes`; no separate label definitions.
- **Log emission**: Per-record logs with severity derived from configured field.
- **Attributes**: Added to all telemetry for pivoting/filtering in backends.
- **Dry run**: If `dryRun: true`, logs metric/log summaries to stderr instead of exporting.
- **Self-telemetry**: If `enableSelfTelemetry: true`, emits scraper's own metrics (scrape duration, record counts, errors).
#### **Scheduler** (`scheduler.py`)
- `APScheduler AsyncIOScheduler` integrated into the asyncio event loop.
- Parses frequency strings (`"5min"`, `"1h"`, `"1d"`, etc.) into cron/interval schedules.
- One job per source; each job calls the scraper engine.
- Supports `runFirstScrape: true` to scrape immediately on startup.
#### **Admin API** (`admin_api.py`)
- Optional FastAPI HTTP server on `servicePort` (default 80).
- Endpoints:
- `GET /health` β health check (always 200).
- `GET /config` β effective config as JSON (auth-gated).
- `POST /scrape/{source_name}` β trigger manual scrape (auth-gated).
- `GET /sources` β list all configured sources (auth-gated).
- Authentication via Bearer token from environment variable (`adminSecretEnv`).
#### **Utils** (`utils.py`)
- **Path extraction** (`lookup_path`): Nested dict/list traversal with dot notation, array indexing, slicing.
- **Datetime handling**: Parse/format with per-source and global format overrides.
- **Frequency parsing** (`parse_frequency`): Convert `"5min"`, `"1h"`, etc. to timedelta.
- **Window slicing** (`window_slices`): Generate sub-windows for parallel scraping.
- **Query building** (`build_query_string`): Construct URL params with optional URL encoding.
#### **Runner** (`runner.py`)
- **Entrypoint**: Loads config, initializes all components, starts scheduler and optional admin API.
- **Cleanup loop**: Background task periodically runs fingerprint store cleanup.
- **Graceful shutdown**: Cancels scheduler, closes HTTP client, flushes telemetry.
Comprehensive guides and examples for every aspect of the scraper:
- Configuration Reference β Global settings, source settings, all options explained
- Authentication Examples β All 6 auth types with real API examples
- Scrape Types Examples β Range vs instant scraping patterns
- Measurement Types Examples β Gauge/counter/histogram configuration patterns
- Self-Telemetry Guide β Complete metrics catalog, PromQL examples, alerting rules, and monitoring best practices
- Local Testing Stack β Docker Compose setup with Grafana + Loki + Prometheus + OTEL collector
π Full documentation: https://aakashh242.github.io/otel-api-scraper/
Contributions welcome! Areas of interest:
- New auth strategies (SAML, Kerberos, mTLS, etc.)
- Receiver for additional data formats (XML, Parquet, Protocol Buffers, etc.)
- Built-in connector templates for popular SaaS (Salesforce, Jira, etc.)
- Performance improvements or test coverage
- Documentation and examples
-
Clone and install dependencies:
git clone <repo-url> cd otel-api-scraper uv sync --dev
-
Install pre-commit hooks:
uv run pre-commit install uv run pre-commit install --hook-type commit-msg
-
Follow conventional commits: All commit messages must follow Conventional Commits format:
git commit -m "feat(auth): add SAML authentication support" git commit -m "fix(scraper): handle timeout errors properly" git commit -m "docs(readme): update installation instructions"
-
Ensure tests pass:
- Code must pass linting (ruff)
- Test coverage must be β₯90%
- All tests must pass
π For detailed contribution guidelines, see CONTRIBUTING.md
