refactor(telemetry)!: move backend tracing onto plugin/hook pattern#1181
refactor(telemetry)!: move backend tracing onto plugin/hook pattern#1181ajbozarth wants to merge 5 commits into
Conversation
Phase 1 of the tracing epic: migrates backend span emission from inline calls scattered across five backends onto a BackendTracingPlugin that subscribes to the existing generation_* hooks for chat and new generation_batch_* hooks for raw. Spans stay live on the OTel context across the API call so nested instrumentation parents under them. - Renames MELLEA_TRACE_* env vars to plural MELLEA_TRACES_* and introduces MELLEA_TRACES_ENABLED umbrella flag, opt-in MELLEA_TRACES_OTLP, and signal-specific OTEL_EXPORTER_OTLP_TRACES_ENDPOINT. No deprecation shim. - Drops deprecated gen_ai.system attribute in favor of gen_ai.provider.name. - Prefixes application-span attributes with mellea. for consistency with backend spans and the existing logging/metrics pillars. - Eagerly initialises the tracer provider and registers the BackendTracingPlugin at module import when MELLEA_TRACES_ENABLED is truthy. Tests reset module state and call _setup_tracing() to re-init after env-var changes, removing the need for importlib.reload. - Adds generation_batch_pre_call/post_call/error hook types so raw-path emits one span per generate_from_raw call (matching OTel GenAI semconv) rather than one per MOT. - Deletes mellea/telemetry/backend_instrumentation.py; backends no longer import from mellea.telemetry (except .context). - Removes the _telemetry_span round-trip on mot._meta and the tracing-specific error block in core/base.py; error-path span closure lives in the plugin's generation_error hook. Closes generative-computing#1045, generative-computing#1046, generative-computing#1047 (Phase 1 of generative-computing#444). BREAKING CHANGE: Public telemetry API surface narrowed: - MELLEA_TRACE_* env vars renamed to plural MELLEA_TRACES_* with no deprecation shim. - The deprecated gen_ai.system span attribute is removed; consumers should read gen_ai.provider.name instead. - The split helpers is_application_tracing_enabled() and is_backend_tracing_enabled() are replaced by a single is_tracing_enabled(). - The application-tracing helpers add_span_event, start_backend_span, end_backend_span, and trace_backend are removed from mellea.telemetry. Backend spans are now emitted by the BackendTracingPlugin automatically; application spans use trace_application (unchanged). - The mellea.telemetry.backend_instrumentation module is deleted along with its exports (start_generate_span, instrument_generate_from_raw, record_token_usage, record_response_metadata, finalize_backend_span). Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
|
Follow-up work spawned by this PR (not part of the tracing epic):
|
|
I also opened #1180 which fixes a bug in metrics that also applies here. (ie that needs to merge first) |
jakelorocco
left a comment
There was a problem hiding this comment.
looks good! I just have a few questions and a few places I think extra comments would be helpful.
- HF backend: detect stop-sequence termination as finish_reason="stop" (previously dropped silently when stopping on a configured stop_string). - Add unit tests for HF post_processing finish_reasons derivation. - Clarify generation_id docstrings/comments to distinguish the Mellea-side hook correlation ID from GenerationMetadata.response_id. - Document why BackendTracingPlugin.on_pre_call runs SEQUENTIAL while the other tracing hooks run FIRE_AND_FORGET. Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
|
@jakelorocco I believe 00831a4 addresses all your actionable comments |
planetf1
left a comment
There was a problem hiding this comment.
This PR is a well-structured refactor — moving backend tracing onto the plugin/hook pattern makes it composable and avoids the tight coupling the old backend_instrumentation module had. The OTel GenAI semconv attributes are correct on the happy path. A few issues need addressing, with one blocker around the pre-computed MOT path.
Two tooling findings can't be filed as inline comments since those files aren't in this diff:
tooling/docs-autogen/audit_coverage.py (lines 30–32) — backend_instrumentation was deleted in this PR but is still listed in _CONFIRMED_INTERNAL_MODULES. The comment says this must stay in sync with generate-ast.py, so both need updating. Remove "backend_instrumentation" from the frozenset (leaving just {"json_util"}).
tooling/docs-autogen/generate-ast.py (lines 437–444) — Same deleted module, same stale entry. Remove the "mellea/telemetry/backend_instrumentation" entry and its associated comment block.
| ) | ||
| raise | ||
| # Save the ID for the post_call / error hooks. | ||
| mot._generation_id = generation_id |
There was a problem hiding this comment.
BLOCKER — When _generate_from_context returns an already-computed MOT (DummyBackend, any caching layer, any backend that pre-fills the thunk synchronously), avalue() short-circuits at base.py:583 and never calls astream(). Since generation_post_call only fires inside astream, the pre_call span is never closed, the OTel context token is never detached, and the entry in _in_flight_spans grows without bound.
The secondary effect is worse: the leaked context token means any asyncio task spawned after this point inherits a context whose current OTel span is the stale, already-exported span. Nested instrumented calls — HTTP clients, other OTel-aware libraries — will appear as children of a span that has already ended, corrupting the trace tree.
The fix goes here, right after mot._generation_id = generation_id. Check mot.is_computed() and if true, fire generation_post_call directly before returning:
if mot.is_computed() and has_plugins(HookType.GENERATION_POST_CALL):
from ..plugins.hooks.generation import GenerationPostCallPayload
await invoke_hook(
HookType.GENERATION_POST_CALL,
GenerationPostCallPayload(
model_output=mot,
generation_id=generation_id,
latency_ms=0.0, # pre-computed: no inference latency to report
),
backend=self,
)This mirrors the path astream already takes for lazy MOTs and closes the span immediately. It also makes the code honour the contract GenerationPostCallPayload's docstring already promises for pre-computed thunks ("fires before generate_from_context returns") — today that sentence describes intended, not actual, behaviour.
| return span | ||
|
|
||
|
|
||
| def _detach_token(token: Token[Context] | None) -> None: |
There was a problem hiding this comment.
The except ValueError here is dead code. opentelemetry.context.detach() in opentelemetry-api 1.41.1 wraps its body in except Exception, logs Failed to detach context via logger.exception (ERROR level, full traceback, under the opentelemetry.context logger), and returns without re-raising. The ValueError never escapes for this catch to see.
In practice: this emits an ERROR-level log with a Python traceback on every LLM call. Both generation_post_call and generation_error are FIRE_AND_FORGET, which spawns a new asyncio task. The OTel context token is bound to the pre_call task. Detaching it from a different task always hits this error path — not occasionally, unconditionally. Any production deployment with OTel diagnostics and error alerting will see a continuous flood, one traceback per generation. (Verified end-to-end against opentelemetry-api 1.41.1: the cross-task detach raises internally, OTel swallows and logs it, and detach returns.)
The root cause is structural. The simplest fix: do not store the token in _in_flight_spans at all. The token's only purpose is to restore the caller's prior context after the span ends, but FIRE_AND_FORGET callers are ephemeral tasks that do not need their context restored. Stash just the span:
_in_flight_spans: dict[str, Span] = {}Skip _detach_token entirely. The span lifecycle (start/end) is unaffected, and the ERROR spam stops. Note this does not remove the stale-current-span issue within the caller task — the pre_call attach is never undone there either way — but the failing detach was not fixing that today, so this is strictly a noise reduction. Properly restoring the caller's context would require detaching in the same task that attached.
|
|
||
| def is_application_tracing_enabled() -> bool: | ||
| """Check if application tracing is enabled. | ||
| def is_content_tracing_enabled() -> bool: |
There was a problem hiding this comment.
is_content_tracing_enabled() is defined, exported, and documented, but nothing in _tracing_setters.py or tracing_plugins.py calls it. Setting MELLEA_TRACES_CONTENT=true or OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true produces no content on any span.
The OTel GenAI semconv defines gen_ai.input.messages and gen_ai.output.messages for this case — the prompt is available in GenerationPreCallPayload.prompt and the response in mot.generation after post_processing. Whether you implement this now or defer it, the function should either have callers or a comment saying the capture path is not yet implemented, so nobody assumes it's already working.
| def _reset_tracing_state() -> None: | ||
| """Reset module state and re-run setup so env-var changes take effect.""" | ||
| tracing._tracer_provider = None | ||
| tracing._application_tracer = None | ||
| tracing._backend_tracer = None | ||
| tracing._setup_tracing() |
There was a problem hiding this comment.
_plugins_registered isn't being reset here. That flag gates whether _register_tracing_plugins() runs — once it flips to True it stays True for the rest of the process. Tests that reset module state via this helper don't get a clean slate: any subsequent _setup_tracing() call will skip re-registration entirely. Tests that need to control or verify plugin registration will see state that leaked from a previous test.
| def _reset_tracing_state() -> None: | |
| """Reset module state and re-run setup so env-var changes take effect.""" | |
| tracing._tracer_provider = None | |
| tracing._application_tracer = None | |
| tracing._backend_tracer = None | |
| tracing._setup_tracing() | |
| def _reset_tracing_state() -> None: | |
| """Reset module state and re-run setup so env-var changes take effect.""" | |
| tracing._tracer_provider = None | |
| tracing._application_tracer = None | |
| tracing._backend_tracer = None | |
| tracing._plugins_registered = False | |
| tracing._setup_tracing() |
| def _reset_tracing_state() -> None: | ||
| """Reset module state and re-run setup so env-var changes take effect.""" | ||
| tracing._tracer_provider = None | ||
| tracing._application_tracer = None | ||
| tracing._backend_tracer = None | ||
| tracing._in_flight_spans.clear() | ||
| tracing._setup_tracing() |
There was a problem hiding this comment.
Same as test_tracing.py — _plugins_registered missing from the reset.
| def _reset_tracing_state() -> None: | |
| """Reset module state and re-run setup so env-var changes take effect.""" | |
| tracing._tracer_provider = None | |
| tracing._application_tracer = None | |
| tracing._backend_tracer = None | |
| tracing._in_flight_spans.clear() | |
| tracing._setup_tracing() | |
| def _reset_tracing_state() -> None: | |
| """Reset module state and re-run setup so env-var changes take effect.""" | |
| tracing._tracer_provider = None | |
| tracing._application_tracer = None | |
| tracing._backend_tracer = None | |
| tracing._in_flight_spans.clear() | |
| tracing._plugins_registered = False | |
| tracing._setup_tracing() |
| if not payload.generation_id: | ||
| return | ||
| mot = payload.model_output | ||
| gen = mot.generation if mot is not None else None |
There was a problem hiding this comment.
The chat pre_call hook starts the span with model=None, provider=None by design — the chat path doesn't know the model until post_processing. But on the error path, if the error fires before post_processing() completes, mot.generation.model and mot.generation.provider are still None (and mot itself may be None, which this line already handles). set_request_attrs in _tracing_setters.py skips both attributes when falsy, so the error span ends with neither gen_ai.request.model nor gen_ai.provider.name set.
Error spans with no model or provider are hard to triage — you can't tell which backend threw the error at a glance. Worth noting as a known limitation here, or plumbing the model/provider from the backend config at pre_call time if it's accessible there.
| from .tracing import ( | ||
| add_span_event, | ||
| end_backend_span, | ||
| is_application_tracing_enabled, | ||
| is_backend_tracing_enabled, | ||
| is_content_tracing_enabled, | ||
| is_tracing_enabled, | ||
| set_span_attribute, | ||
| set_span_error, | ||
| start_backend_span, | ||
| trace_application, | ||
| trace_backend, | ||
| ) |
There was a problem hiding this comment.
set_span_status_error was added in this PR and is in tracing.py's __all__, but it's not in the package-level export. Users importing from mellea.telemetry won't find it there; they'd have to import from the submodule directly. Inconsistent with how the other span helpers are exposed.
Also needs adding to __all__ lower in the file (between "set_span_error" and "trace_application").
| from .tracing import ( | |
| add_span_event, | |
| end_backend_span, | |
| is_application_tracing_enabled, | |
| is_backend_tracing_enabled, | |
| is_content_tracing_enabled, | |
| is_tracing_enabled, | |
| set_span_attribute, | |
| set_span_error, | |
| start_backend_span, | |
| trace_application, | |
| trace_backend, | |
| ) | |
| from .tracing import ( | |
| is_content_tracing_enabled, | |
| is_tracing_enabled, | |
| set_span_attribute, | |
| set_span_error, | |
| set_span_status_error, | |
| trace_application, | |
| ) |
| ) | ||
|
|
||
|
|
||
| def finish_backend_span_success( |
There was a problem hiding this comment.
finish_backend_span_error has an if not _OTEL_AVAILABLE: return guard at the top, but finish_backend_span_success doesn't. Functionally harmless since both pop from _in_flight_spans first, and that dict is only populated when a tracer is available. The asymmetry is just a bit surprising when reading the two functions side by side.
| @@ -41,11 +46,16 @@ class GenerationPostCallPayload(MelleaBasePayload): | |||
| model_output: The fully-computed `ModelOutputThunk`. | |||
There was a problem hiding this comment.
The docstring body just above this (lines 38–42) contains a false statement: "For already-computed thunks (e.g. cached responses) it fires before generate_from_context returns." That code path doesn't exist — generate_from_context doesn't check mot.is_computed() and doesn't fire this hook early. This is the same gap as the blocker above, but the docstring is wrong regardless of when that gets fixed. The sentence starting "For already-computed thunks" should be removed.
| > **Tip:** In pytest, use `monkeypatch.setenv` to control tracing per-test — | ||
| > see `test/telemetry/test_tracing.py` for the canonical pattern. |
There was a problem hiding this comment.
Pointing readers to an internal test file isn't great docs practice — external users won't have it, and contributors still have to go hunting. Easier to just inline the pattern.
| > **Tip:** In pytest, use `monkeypatch.setenv` to control tracing per-test — | |
| > see `test/telemetry/test_tracing.py` for the canonical pattern. | |
| > **Tip:** In pytest, use `monkeypatch.setenv("MELLEA_TRACES_ENABLED", "true")` to enable tracing for a test. Call `mellea.telemetry.tracing._setup_tracing()` after setting the env var (and again on teardown) so the module picks up the change — env vars are read at initialisation time, not on every call. |
|
Overall take: The architectural direction here is right. Moving backend tracing onto the plugin/hook pattern decouples it cleanly from individual backends — new backends get instrumentation for free, and the separation of concerns is much better than the old Two structural issues need fixing before this merges:
The remaining comments (semconv gap, test state isolation, missing export, stale tooling refs) are all quick fixes. None of them block the approach. |
planetf1
left a comment
There was a problem hiding this comment.
Two blocking bugs introduced in this PR, one note referencing a follow-up issue.
| finish_backend_span_error, | ||
| finish_backend_span_success, | ||
| start_backend_span, | ||
| ) |
There was a problem hiding this comment.
The top-level import on lines 17–21 creates a circular import that silences the entire plugin framework at runtime.
Sequence:
MELLEA_TRACES_ENABLED=true→tracing.py:181calls_setup_tracing()at module load._setup_tracing()calls_register_tracing_plugins(), which importstracing_plugins.py.tracing_plugins.py:17tries to importfinish_backend_span_error,finish_backend_span_success,start_backend_spanfromtracing.py— which is still partially initialised at that point.- The resulting
ImportErroris caught attracing.py:132–139and replaced with: "plugin framework not installed / pip install mellea[telemetry]".
cpex is installed. The warning is wrong. All backend spans and every metrics plugin (TokenMetricsPlugin, LatencyMetricsPlugin, ErrorMetricsPlugin) are silently dead — they never register.
Empirically confirmed: running the walkthrough with all OTLP env vars set produces start_session/session_context/aact application spans but zero backend chat spans and zero mellea.* metrics.
The TYPE_CHECKING guard is already used on lines 24–32 for the payload types — the same pattern applies here. Options: move the three function imports inside each hook method (lazy, cheap since called once per generation), or bind them after tracing.py is fully initialised.
| if len(response_value) > 500: | ||
| response_value = response_value[:500] + "..." | ||
| set_span_attribute(span, "response", response_value) | ||
| set_span_attribute(span, "mellea.response", response_value) |
There was a problem hiding this comment.
mellea.response is set unconditionally on every aact span, regardless of whether the operator has opted in to content capture.
is_content_tracing_enabled() is defined in tracing.py but is never called anywhere in the codebase. The MELLEA_TRACES_CONTENT opt-in contract documented in the PR description is not enforced.
Up to 500 chars of model output (potentially including PII or confidential content) reaches every configured OTLP exporter on every call by default. An operator who has not set MELLEA_TRACES_CONTENT=true has no way to prevent this.
Fix: gate this call on is_content_tracing_enabled(). The function is already there; it just needs to be wired.
| action = payload.action | ||
| fmt = payload.format | ||
| start_backend_span( | ||
| "chat", |
There was a problem hiding this comment.
Span name "chat" diverges from the OTel GenAI semantic conventions, which specify "{operation} {gen_ai.request.model}" — e.g. "chat granite4.1:8b". The model suffix enables model-level filtering directly in trace UIs and pipelines without requiring an attribute filter.
This is one of several semconv alignment items tracked in a follow-up issue (to be linked once filed). Noting it here since start_backend_span("chat", ...) is new in this PR and the fix is a one-liner at this call site — but whether to address it here or in the follow-up is the author's call.
| finish_backend_span_error, | ||
| finish_backend_span_success, | ||
| start_backend_span, | ||
| ) |
There was a problem hiding this comment.
The top-level import on lines 17–21 creates a circular import that silences the entire plugin framework at runtime.
Sequence:
MELLEA_TRACES_ENABLED=true→tracing.py:181calls_setup_tracing()at module load._setup_tracing()calls_register_tracing_plugins(), which importstracing_plugins.py.tracing_plugins.py:17tries to importfinish_backend_span_error,finish_backend_span_success,start_backend_spanfromtracing.py— which is still partially initialised at that point.- The resulting
ImportErroris caught attracing.py:132–139and replaced with: "plugin framework not installed / pip install mellea[telemetry]".
cpex is installed. The warning is wrong. All backend spans and every metrics plugin (TokenMetricsPlugin, LatencyMetricsPlugin, ErrorMetricsPlugin) are silently dead — they never register.
Empirically confirmed: running with all OTLP env vars set produces start_session/session_context/aact application spans but zero backend chat spans and zero mellea.* metrics.
The TYPE_CHECKING guard is already used on lines 24–32 for the payload types — the same pattern applies here. Options: move the three function imports inside each hook method (lazy, cheap since called once per generation), or bind them after tracing.py is fully initialised.
| if len(response_value) > 500: | ||
| response_value = response_value[:500] + "..." | ||
| set_span_attribute(span, "response", response_value) | ||
| set_span_attribute(span, "mellea.response", response_value) |
There was a problem hiding this comment.
mellea.response is set unconditionally on every aact span, regardless of whether the operator has opted in to content capture.
is_content_tracing_enabled() is defined in tracing.py but is never called anywhere in the codebase. The MELLEA_TRACES_CONTENT opt-in contract documented in the PR description is not enforced.
Up to 500 chars of model output (potentially including PII or confidential content) reaches every configured OTLP exporter on every call by default. An operator who has not set MELLEA_TRACES_CONTENT=true has no way to prevent this.
Fix: gate this call on is_content_tracing_enabled(). The function is already there; it just needs to be wired.
| action = payload.action | ||
| fmt = payload.format | ||
| start_backend_span( | ||
| "chat", |
There was a problem hiding this comment.
Span name "chat" diverges from the OTel GenAI semantic conventions, which specify "{operation} {gen_ai.request.model}" — e.g. "chat granite4.1:8b". The model suffix enables model-level filtering directly in trace UIs and pipelines without requiring an attribute filter.
This is one of several semconv alignment items tracked in #1209. Noting it here since start_backend_span("chat", ...) is new in this PR and the fix is a one-liner at this call site — but whether to address it here or in the follow-up is the author's call.
Pull Request
Issue
Fixes #1045, fixes #1046, fixes #1047. Phase 1 of #444. Also closes out item 2 of #909.
Description
Phase 1 of the tracing epic. Migrates backend span emission from inline calls scattered across five backends onto a
BackendTracingPluginthat subscribes to the existinggeneration_*hooks (chat path) and newgeneration_batch_*hooks (raw path). Spans stay live on the OTel context across the API call so nested instrumentation (httpx, langchain) parents under the backend span.Reviewer call-outs:
MELLEA_TRACE_*→MELLEA_TRACES_*(plural), aligned withOTEL_EXPORTER_OTLP_TRACES_ENDPOINTandMELLEA_METRICS_*. AddsMELLEA_TRACES_ENABLEDumbrella flag and opt-inMELLEA_TRACES_OTLP. No deprecation shim — tracing was still pre-1.0 surface.gen_ai.systemremoved in favor ofgen_ai.provider.name. Cleans up the dual emission introduced by feat(telemetry): close five OTel GenAI semantic convention emission gaps (#1035) #1036.generation_batch_pre_call/generation_batch_post_call/generation_batch_errorfor plugin authors. Raw path now emits one span pergenerate_from_rawcall (per OTel GenAI semconv), not one per MOT.ModelOutputThunk.cancel_generation(added upstream by feat(stdlib): add stream_with_chunking() with per-chunk validation (#901) #942 in9e8a9636) had its body rewritten to fireGENERATION_ERRORinstead of calling the removed_telemetry_spanAPI. Reviewers may notice as "extra surface", but it's rebase-driven, not refactor-driven.BREAKING CHANGE: Public telemetry API surface narrowed:
MELLEA_TRACE_*env vars renamed to pluralMELLEA_TRACES_*with no deprecation shim.gen_ai.systemspan attribute is removed; consumers should readgen_ai.provider.nameinstead.is_application_tracing_enabled()andis_backend_tracing_enabled()are replaced by a singleis_tracing_enabled().add_span_event,start_backend_span,end_backend_span, andtrace_backendare removed frommellea.telemetry. Backend spans are now emitted by theBackendTracingPluginautomatically; application spans usetrace_application(unchanged).mellea.telemetry.backend_instrumentationmodule is deleted along with its exports (start_generate_span,instrument_generate_from_raw,record_token_usage,record_response_metadata,finalize_backend_span).Testing
Attribution
Adding a new component, requirement, sampling strategy, or tool?
If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.
NOTE: Please ensure you have an issue that has been acknowledged by a core contributor and routed you to open a pull request against this repository. Otherwise, please open an issue before continuing with this pull request.