Skip to content

AssemblyAI/streaming-self-hosting-stack

Repository files navigation

Streaming Self-Hosted Services Docker Compose

This Docker Compose configuration runs the AssemblyAI streaming services as a standalone self-hosted stack.

Choosing a stack

Two compose files are shipped. Pick the one that matches the model you want to serve — they are mutually exclusive (run one at a time):

File Models served GPU requirement
docker-compose.yml Universal English + Multilingual streaming NVIDIA T4+ per ASR container
docker-compose.u3pro.yml U3 Pro 24 GB+ VRAM (e.g. L4, A10, A100); image bundles ~14 GB of weights

To switch between stacks, run docker compose down (or docker compose -f docker-compose.u3pro.yml down) before starting the other.

Services Included

Both stacks include:

  • streaming-api: Gateway API service handling WebSocket connections.
  • streaming-asr-lb: nginx load balancer for ASR services with header-based routing.
  • license-and-usage-proxy: License validation and usage reporting service.

ASR backends differ by stack:

  • Universal stack (docker-compose.yml): streaming-asr-english and streaming-asr-multilang.
  • U3 Pro stack (docker-compose.u3pro.yml): streaming-asr-u3pro.

Connection Flow

Universal stack (docker-compose.yml):

Websocket client → streaming-api:8080 (WebSocket)
                          │
                          ├─ Usage reporting     ───────→ license-and-usage-proxy:8080 [if usage-based billing] ────→ https://usage-tracker.assemblyai.com
                          │                               │
                          ├─ License validation  ─────────┘
                          │
                          └─ ASR requests        ───────→ streaming-asr-lb:80 → Header-based routing (X-Model-Version):
                                                                                ├── en-default → streaming-asr-english:50051 (gRPC)
                                                                                └── ml-default → streaming-asr-multilang:50051 (gRPC)

U3 Pro stack (docker-compose.u3pro.yml):

Websocket client → streaming-api:8080 (WebSocket)
                          │
                          ├─ Usage reporting     ───────→ license-and-usage-proxy:8080 [if usage-based billing] ────→ https://usage-tracker.assemblyai.com
                          │                               │
                          ├─ License validation  ─────────┘
                          │
                          └─ ASR requests        ───────→ streaming-asr-lb:80 → Header-based routing (X-Model-Version):
                                                                                └── u3-pro → streaming-asr-u3pro:50051 (gRPC)

Both stacks share the same nginx_streaming_asr.conf, which routes by X-Model-Version header. Each stack only deploys the backends it needs — websocket clients should use a speech_model query parameter value that routes to an available backend.

Prerequisites

  1. AssemblyAI license: Valid for the streaming self-hosted product.
  2. Docker & Docker Compose: Ensure Docker and Docker Compose are installed.
  3. GPU Support: NVIDIA Container Toolkit for GPU-enabled services.
  4. AWS Access: Valid AWS credentials to pull images from ECR.

Setup Instructions

1. Docker runtime with GPU support

1.1 Verify NVIDIA drivers are installed:

nvidia-smi

1.2 Install NVIDIA Container Toolkit:

Follow the NVIDIA Container Toolkit installation guide to set up GPU support for Docker.

1.3 Verify the Docker runtime has GPU access:

docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

2. AWS ECR Authentication

# Login to ECR to pull container images
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 344839248844.dkr.ecr.us-west-2.amazonaws.com

3. Configure Container Images

Use the reference .env.example file to create a .env file with container image references:

Set the image variables relevant to the stack you plan to run:

# Required for both stacks:
STREAMING_API_IMAGE=<CUSTOM_IMAGE>
LICENSE_AND_USAGE_PROXY_IMAGE=<CUSTOM_IMAGE>

# Required for the universal stack (docker-compose.yml):
STREAMING_ASR_ENGLISH_IMAGE=<CUSTOM_IMAGE>
STREAMING_ASR_MULTILANG_IMAGE=<CUSTOM_IMAGE>

# Required for the U3 Pro stack (docker-compose.u3pro.yml):
STREAMING_ASR_U3PRO_IMAGE=<CUSTOM_IMAGE>

4. Have the license file ready

Ensure you have your AssemblyAI license file in the current working directory as license.jwt, or modify the LICENSE_FILE_PATH environment variable in the relevant Docker Compose file to point to your license file location.

5. Run Services

Pick the stack you want to run. Both use the same streaming-api, load balancer, and license proxy — they differ only in the ASR backend.

For the U3 Pro stack, websocket clients should set query parameter speech_model to "u3-rt-pro" so the load balancer routes to the U3 Pro backend.

Universal stack (English + Multilingual):

docker compose up -d

docker compose logs -f

# Check service status
docker compose ps

# Stop services before switching stacks
docker compose down

U3 Pro stack:

docker compose -f docker-compose.u3pro.yml up -d

docker compose -f docker-compose.u3pro.yml logs -f

# Check service status
docker compose ps

# Stop services before switching stacks
docker compose -f docker-compose.u3pro.yml down

Service Endpoints

  • WebSocket: ws://localhost:8080

Running the Streaming Example

A Python example script is provided to demonstrate how to stream audio to the self-hosted stack.

Note: You can initiate a session as soon as the relevant ASR container is healthy. streaming-asr-english and streaming-asr-multilang log "Ready to serve!" when ready (typically ~2 min). streaming-asr-u3pro logs "U3Pro ASR Server ready!" when ready (typically ~5 min).

Change the current directory to the streaming_example directory:

cd streaming_example

Create a fresh Python virtual environment and activate it:

python -m venv streaming_venv
source streaming_venv/bin/activate

Install the required packages to run the example script:

pip install -r requirements.txt

The example script (example_with_prerecorded_audio_file.py) accepts several CLI arguments:

Basic usage:

  • Universal stack English:
    python example_with_prerecorded_audio_file.py --audio-file "example_audio_file.wav" --endpoint "ws://localhost:8080" --speech-model "universal-streaming-english"
  • Universal stack Multilingual:
    python example_with_prerecorded_audio_file.py --audio-file "example_audio_file.wav" --endpoint "ws://localhost:8080" --speech-model "universal-streaming-multilingual"
  • U3 Pro stack:
    python example_with_prerecorded_audio_file.py --audio-file "example_audio_file.wav" --endpoint "ws://localhost:8080" --speech-model "u3-rt-pro"

Command-line arguments:

Argument Description Default
--audio-file Path to the audio file to transcribe example_audio_file.wav
--endpoint WebSocket endpoint URL ws://localhost:8080
--speech-model Speech model to use (e.g., 'universal-streaming-multilingual') ``

View help:

python example_with_prerecorded_audio_file.py --help

Configuration

Nginx Configuration

ASR Load Balancer (nginx_streaming_asr.conf):

  • gRPC proxying to ASR services.
  • Routes to English or Multilang model based on the X-Model-Version header value.

Usage Reporting Configuration

The license-and-usage-proxy service supports two billing modes based on your AssemblyAI license:

Flat Billing Mode

If your license is configured for flat billing, usage tracking is disabled. No additional configuration is required.

Usage-Based Billing Mode

If your license is configured for usage-based billing, the proxy will automatically report usage data to AssemblyAI's usage tracker service. You must configure the following environment variable in the docker-compose.yml for the license-and-usage-proxy service:

environment:
  - USAGE_TRACKING_API_KEY=<your-api-key>

Important Notes:

  • For the API key, any key retrieved from the AssemblyAI dashboard can be used.
  • At startup, the proxy validates connectivity by registering with AssemblyAI's https://usage-tracker.assemblyai.com.
  • If connectivity validation fails, the proxy will shut down.
  • Usage data is batched and reported every few seconds.
  • The proxy automatically retries failed requests up to several times.

Critical Behavior: If https://usage-tracker.assemblyai.com becomes unreachable and all retry attempts fail (after 5-60 minutes), the license-and-usage-proxy service will terminate itself. This is a fail-safe mechanism to ensure usage data integrity. Your service orchestrator should be configured to automatically replace the container with a new one.

Monitoring Recommendations:

  • Monitor the proxy's logs for warnings about failed usage reporting attempts.
  • Set up alerts for proxy restarts, which may indicate persistent connectivity issues.
  • If the in-memory usage queue size exceeds 1000 items, the proxy will log a warning suggesting upscaling.

Monitoring & Debugging

Check Service Status

# Container status
docker compose ps

# Resource usage
docker stats

Troubleshooting

Debug Commands

# Check nginx configurations
docker compose exec streaming-asr-lb nginx -t

# Restart specific service (universal stack)
docker compose restart streaming-api
docker compose restart streaming-asr-english
docker compose restart streaming-asr-multilang

# Restart specific service (U3 Pro stack)
docker compose -f docker-compose.u3pro.yml restart streaming-asr-u3pro

Production Deployment Recommendations

streaming-api service

  • Deployment Strategy: We recommend doing Blue/Green deployments to avoid disrupting ongoing sessions. Once you fully shift the traffic to the new color, wait at least 3 hours (the max session duration) before shutting down the old color to ensure no sessions get disrupted.
  • Resource Allocation: We recommend allocating 1 CPU per container with at least 2GB of RAM for better hardware utilization. For example, it's better to have 4 containers with 1 CPU and 2GB RAM each rather than 1 container with 4 CPU and 8GB RAM.
  • Autoscaling: We recommend setting up autoscaling based on the number of active sessions. A container with 1 CPU can generally handle around 32 concurrent sessions.
  • Monitoring: Always monitor the logs during deployment to catch any potential issues early.
  • Dependencies: For successful startup, the service depends on the license-and-usage-proxy service being up and running.
  • Configuration: You can enable features like TLS encryption and structured logging via environment variables.
  • Health Checks: Use the healthcheck command provided in the docker-compose.yml to monitor container health.
  • Usage Reporting Behavior: After each session completes, the streaming-api reports usage to the license-and-usage-proxy with automatic retries on failure. Monitor logs any messages at a >= warning level.

license-and-usage-proxy service

  • Deployment Strategy: Do gradual rollouts to ensure stability. Consider implementing monitoring and alerting for service restarts.
  • Resource Allocation: We recommend allocating 1 CPU per container with at least 2GB of RAM for better hardware utilization. For example, it's better to have 4 containers with 1 CPU and 2GB RAM each rather than 1 container with 4 CPU and 8GB RAM.
  • Monitoring: Always monitor logs during deployment to catch any potential issues early. You can set up an alert based on the responses of the /v1/status endpoint to alert you on any license issues. For usage-based billing, also monitor for usage reporting warnings and service restarts.
  • Dependencies:
    • For successful startup, the service depends on having a valid license being mounted on the container filesystem. To mount it, set the LICENSE_FILE_PATH environment variable to point to the license file path on the host machine.
    • For usage-based billing, the service also requires connectivity to https://usage-tracker.assemblyai.com at startup. If connectivity validation fails, the container will terminate. Ensure the USAGE_TRACKING_API_KEY environment variable is properly configured.
  • Health Checks: Use the healthcheck command provided in the docker-compose.yml to monitor container health.
  • Usage Reporting Resilience:
    • Network connectivity to the https://usage-tracker.assemblyai.com endpoint must be reliable for production deployments with usage-based billing.
    • Run at least a few containers behind a load balancer to ensure high availability.

License Status Endpoint

The /v1/status endpoint provides real-time information about the license validation state:

Endpoint: GET /v1/status

Response Schema:

{
  "state": "Ready | Connected | TrustBased | Failed",
  "last_successful_checkin": "2025-01-01T12:00:00.000000Z",
  "trust_expiration": "2025-01-05T12:00:00.000000Z"
}

State Descriptions:

  • Ready: Initial state when the service starts before any license validation has occurred.
  • Connected: Last license validation check was successful.
  • TrustBased: Last license validation check failed, but the request was within the trust window grace period, so services will remain operational.
  • Failed: Last license validation check failed and the trust window has expired. streaming-api containers will shut down and stop serving requests.

Fields:

  • state: Current license validation state.
  • last_successful_checkin: ISO 8601 timestamp of the last successful license validation (null if never successful).
  • trust_expiration: ISO 8601 timestamp when the trust window expires (null if no successful validation yet).

Recommended Alerts:

  • Alert when state transitions to TrustBased (indicates license validation issues).
  • Critical alert when state is Failed (services will shut down).

streaming-asr-english and streaming-asr-multilang services

  • Deployment Strategy: Do gradual rollouts to ensure stability. Both Blue/Green and rolling deployments are good strategies, as the streaming-api can reconnect to a new streaming-asr container if a persistent connection gets disrupted with minimal state loss.
  • Hardware Requirements: The services can run on NVIDIA T4 or newer GPUs. We recommend allocating at least 4 CPU and 16GB of RAM per container.
  • Autoscaling: You can set up autoscaling based on the number of active sessions. A container with recommended hardware can generally handle up to 28 concurrent sessions.
  • Monitoring: Always monitor logs during deployment to catch any potential issues early.
  • Health Checks: Use the healthcheck command provided in the Docker Compose file to monitor container health.

streaming-asr-u3pro service

  • Deployment Strategy: Do gradual rollouts to ensure stability. Both Blue/Green and rolling deployments are good strategies, as the streaming-api can reconnect to a new streaming-asr-u3-pro container if a persistent connection gets disrupted with minimal state loss.
  • Hardware Requirements: NVIDIA L4 / A10 / A100 / L40S / H100 or equivalent with at least 24 GB VRAM. The container also needs ~14 GB of disk for the bundled model weights.
  • Autoscaling: You can set up autoscaling based on the number of active sessions. A container using L40S GPU can generally handle up to 40 concurrent sessions.
  • Monitoring: Always monitor logs during deployment to catch any potential issues early.
  • Health Checks: Use the healthcheck command provided in the Docker Compose file to monitor container health.

Changelog

v0.6.0

U3 Pro — New Self-Hosted Stack (NEW)

This release introduces the U3 Pro self-hosted stack (docker-compose.u3pro.yml), which serves the U3 Pro streaming model. U3 Pro delivers significant improvements over the universal English model on complex entities, short utterances, and end-of-turn (EOT) latency, and is targeted at voice agent scenarios.

Hardware: NVIDIA L4 / A10 / A100 / L40S / H100 (24 GB+ VRAM).

Highlights of U3 Pro behavior delivered with this release:

  • New transcription prompt ("Transcribe verbatim with standard punctuation. Include filler words and incomplete utterances.") — 22% reduction in voice-agent hallucinations, 10% WER and 29% short-utterance error-rate reduction on voice-agent traffic, 5% improvement on medical, and improved EP F1.
  • Continuous partials during long turns — partials are emitted incrementally instead of being delayed; turns now stitch up to 60s instead of hard-cutting at 16s/32s.
  • Early partial at 750ms of detected speech for faster UI feedback.

Streaming API — New Features

  • continuous_partials query parameter — clients can opt into continuous partials during long turns.
  • Structured logging — both the U3 Pro ASR server and the universal ASR server now honor USE_STRUCTURED_LOGGING, matching the streaming-api behavior.

Other Improvements

  • Various logging and metrics improvements across the streaming-api and ASR services.
  • Bug fixes and stability improvements.

v0.5.0

English ASR Model

A new English model is released, which produces already-formatted outputs directly and delivers large quality gains on digits, telephony, medical, and CI segments:

  • 34% improvement on digit sequence error rate (DSER)
  • 17% improvement on telephony WER
  • 12% average improvement on medical WER
  • 10% average improvement on CI segments WER
  • ~2.4% absolute F1 score improvement on keyterms prompting
  • Significantly improved timestamp accuracy — resolves overlapping and zero-duration word issues

Multilingual ASR Model

  • ~70% absolute improvement in timestamp accuracy — fixes overlapping words and zero-duration word bugs

Streaming API — New Features

  • Error and Warning WebSocket message types — Dedicated message types that let clients distinguish actionable errors from non-fatal warnings without relying on close codes.
  • Configuration echoed in SessionBegins — The SessionBegins message now includes the resolved session configuration so clients can verify applied settings.
  • Explicit speech-model selection — Clients explicitly select the speech model at session start.

Streaming API — Fixes and Improvements

  • More specific WebSocket close codes for session termination scenarios, making client-side error handling more precise.
  • Improved word_finalized events — All word finalizations are emitted (not only the last word of a turn).

Other Improvements

  • Various logging, metrics, and observability improvements across the streaming-api and ASR services.
  • Bug fixes and stability improvements.

v0.4.0

English ASR Model

Major improvements to short utterance handling and hallucination reduction:

  • 100% reduction in hallucinations
  • 12.8% improvement on short utterances - Better performance for voice agent use cases
  • 7.39% improvement on digit sequence error rate
  • 1.75% improvement on proper nouns
  • 0.46% improvement on CI segments
  • 0.39% improvement on accented speech

Multilingual ASR Model

  • Context biasing support - Customers can now use context biasing (model-based biasing) with the multilingual model

Other Improvements

  • Increased concurrent session handling per container, leading to reduced deployment costs
  • Improved observability for the license-and-usage-proxy service
  • Various bug fixes and stability improvements

About

AssemblyAI streaming self-hosting resources and examples.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages