Skip to content

feat: add token usage metrics with OpenTelemetry integration#563

Open
ajbozarth wants to merge 8 commits intogenerative-computing:mainfrom
ajbozarth:feat/token-usage-metrics-v2
Open

feat: add token usage metrics with OpenTelemetry integration#563
ajbozarth wants to merge 8 commits intogenerative-computing:mainfrom
ajbozarth:feat/token-usage-metrics-v2

Conversation

@ajbozarth
Copy link
Contributor

@ajbozarth ajbozarth commented Feb 26, 2026

Misc PR

Type of PR

  • Bug Fix
  • New Feature
  • Documentation
  • Other

Description

Summary

Adds token usage metrics tracking across all Mellea backends using OpenTelemetry metrics counters, following Gen-AI Semantic Conventions for standardized observability.

Changes

Core Implementation

  • Added record_token_usage_metrics() function in mellea/telemetry/metrics.py
  • Implemented lazy initialization of token counters (mellea.llm.tokens.input, mellea.llm.tokens.output)
  • Integrated token tracking into all backends: OpenAI, Ollama, WatsonX, LiteLLM, and HuggingFace
  • Added console exporter support for debugging (MELLEA_METRICS_CONSOLE=true)

Configuration

  • New environment variable: MELLEA_METRICS_ENABLED (default: false)
  • New environment variable: MELLEA_METRICS_CONSOLE (default: false)
  • Metrics export via existing OTEL_EXPORTER_OTLP_ENDPOINT

Metrics Attributes

All token metrics include Gen-AI semantic convention attributes:

  • gen_ai.system - Backend system name (e.g., openai, ollama)
  • gen_ai.request.model - Model identifier
  • mellea.backend - Backend class name

Testing

  • Added comprehensive unit tests for metrics configuration and recording
  • Added integration tests for all backends (Ollama, OpenAI, WatsonX, LiteLLM, HuggingFace)
  • Tests verify proper token counting and attribute tagging

Documentation

  • Updated docs/dev/telemetry.md with complete metrics documentation
  • Added usage examples and configuration guide
  • Documented backend support matrix

Backend Support

Backend Support Token Source
OpenAI ✅ Full usage.prompt_tokens, usage.completion_tokens
Ollama ✅ Full prompt_eval_count, eval_count
WatsonX ✅ Full input_token_count, generated_token_count
LiteLLM ✅ Full usage.prompt_tokens, usage.completion_tokens
HuggingFace ✅ Full Calculated from input_ids and output sequences

Breaking Changes

None - metrics are disabled by default and require explicit opt-in via MELLEA_METRICS_ENABLED=true.

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code as added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Add mellea.llm.tokens.input/output counters following Gen-AI semantic conventions with zero overhead when disabled

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
…LM backends

Add record_token_usage_metrics() calls to all backend post_processing methods to track input/output tokens. Add get_value() helper in backends/utils.py to handle dict/object attribute extraction.

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
Calculate token counts from input_ids and output sequences. Records to both tracing spans and metrics using helper function.

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
- Add integration tests for Ollama, OpenAI, LiteLLM, HuggingFace, WatsonX
- Tests revealed metrics were coupled with tracing (architectural issue)
- Fixed: Metrics now record independently of tracing spans
- WatsonX: Store full response to preserve usage information
- HuggingFace: Add zero-overhead guard, optimize test model

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
…ation

Use MonkeyPatch for cleanup and update Watsonx to granite-4-h-small.

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
- Add Token Usage Metrics section to docs/dev/telemetry.md with metric
  definitions, backend support table, and configuration examples
- Create metrics_example.py demonstrating token tracking with tested
  console output
- Update telemetry_example.py to reference new metrics example
- Update examples/telemetry/README.md with metrics quick start guide

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth ajbozarth self-assigned this Feb 26, 2026
@ajbozarth ajbozarth requested a review from a team as a code owner February 26, 2026 22:45
@github-actions
Copy link
Contributor

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@mergify
Copy link

mergify bot commented Feb 26, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth
Copy link
Contributor Author

After opening this I had Bob and Claude to in depth reviews and they came back with a handful of things I want to address. I will work on fixing those tomorrow

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth
Copy link
Contributor Author

I've push a small update to test and doc streaming support, as suggested by AI review.

As of now this is ready for full review and merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement counters to track token usage across all LLM backends with model and backend labels

1 participant