fix: add fork-safety and auto-create analytics views to BQ plugin by caohy1988 · Pull Request #4640 · google/adk-python

caohy1988 · 2026-02-27T06:58:39Z

Summary

Fork-safety (BigQueryAgentAnalyticsPlugin Deadlock on Linux Multiprocessing #4636): Adds PID tracking to BigQueryAgentAnalyticsPlugin so that forked child processes detect stale gRPC channels and re-initialize instead of deadlocking. Uses the standard os.getpid() pattern (same as SQLAlchemy, gRPC-python).
Auto-create views (Auto-create BigQuery views when BQ Agent Analytics plugin creates a new table #4639): After ensuring the agent_events table exists, automatically creates 15 per-event-type BigQuery views (v_llm_request, v_tool_completed, etc.) that unnest JSON columns into typed, queryable columns. Controlled by BigQueryLoggerConfig.create_views (default True), idempotent via CREATE OR REPLACE VIEW.
Trace-ID continuity & o11y alignment ([ADK BigQuery Plugin] Event trace_id fractures between INVOCATION_STARTING and AGENT_STARTING #4645): Fixes trace_id fracture between early events (USER_MESSAGE_RECEIVED, INVOCATION_STARTING) and later events (AGENT_STARTING onwards) when no ambient OTel span exists. Also aligns BQ rows with Cloud Trace span IDs when o11y is active.
Span-ID consistency under ambient OTel (fix: add fork-safety and auto-create analytics views to BQ plugin #4640 review): Fixes *_STARTING / *_COMPLETED events producing mismatched span IDs when an ambient OTel span is active. Completion callbacks now check for ambient spans and defer to _resolve_ids Layer 2 instead of overriding with plugin-synthetic IDs.
Stack leak safety (fix: add fork-safety and auto-create analytics views to BQ plugin #4640 review): Adds TraceManager.clear_stack() and makes ensure_invocation_span() clear stale records from different invocations, preventing span stack leaks across invocations. Uses _active_invocation_id_ctx to distinguish stale leak vs same-invocation re-entry.
Root agent name staleness fix: init_trace() now refreshes _root_agent_name_ctx unconditionally on each invocation (previously set-once-on-None). after_run_callback resets it alongside other invocation cleanup.
Exception-safe cleanup: after_run_callback uses try/finally to guarantee invocation state (clear_stack, _active_invocation_id_ctx, _root_agent_name_ctx) is always reset, even if _log_event raises.
Bonus: on_tool_error_callback span fix: Previously discarded the span_id from pop_span(), causing TOOL_ERROR events to get the wrong span. Now captures and uses the popped span_id.

Trace-ID fix details (#4645)

Root cause: When OTel is configured but the Runner's ambient span is absent (Agent Engine, custom runners), early events fell back to invocation_id while push_span("agent") created a new OTel span with a different hex trace_id.

Fix:

TraceManager.ensure_invocation_span() — called in on_user_message_callback and before_run_callback before any events fire. Attaches the ambient span if valid, or creates a new root span. Idempotent within the same invocation via _active_invocation_id_ctx.
push_span() — now creates child spans under existing stack parent via trace.set_span_in_context(), so all spans share one trace_id.
_resolve_ids() (replaces _resolve_span_ids()) — 3-layer priority for trace_id/span_id/parent_span_id:
1. EventData overrides (post-pop callbacks like after_run_callback)
2. Ambient OTel span (aligns BQ rows with Cloud Trace when o11y is active)
3. Plugin's internal span stack (fallback for no-ambient paths)
EventData.trace_id_override — captured before popping the invocation span in after_run_callback, ensuring INVOCATION_COMPLETED shares the same trace_id.

Result when o11y is enabled: BQ rows use the same trace_id and span_id as the framework's Cloud Trace spans (invocation, invoke_agent, call_llm, execute_tool).

Span-ID consistency fix details

Root cause: *_STARTING events used ambient framework span IDs via _resolve_ids Layer 2, while *_COMPLETED events bypassed Layer 2 by passing explicit span_id_override / parent_span_id_override from the plugin's internal popped span. This meant a single lifecycle (e.g. AGENT_STARTING → AGENT_COMPLETED) could mix ambient framework IDs and plugin synthetic IDs.

Fix: Six completion callbacks (after_agent, after_model, on_model_error, after_tool, on_tool_error, after_run) now check trace.get_current_span().get_span_context().is_valid and pass None overrides when ambient exists, letting _resolve_ids Layer 2 use the framework's ambient span — keeping STARTING/COMPLETED pairs consistent.

Span-ID resolution contract:

When OTel is active: BQ rows use the same trace/span/parent IDs as Cloud Trace. STARTING and COMPLETED events in the same lifecycle share the same span_id.
When OTel is not active: BQ rows use the plugin's internal span stack. STARTING gets the current top-of-stack; COMPLETED gets the popped span.

Stack leak safety details

Root cause: _span_records_ctx is not keyed by invocation and relies on balanced push/pop. If after_run_callback is skipped (e.g. exception in generator, asyncio cancellation), stale records leak into the next invocation.

Fix:

_active_invocation_id_ctx — new contextvar tracking the invocation that owns the current stack. Allows ensure_invocation_span() to distinguish same-invocation re-entry (idempotent no-op) from stale leak (clear and re-init).
TraceManager.clear_stack() — ends all owned spans and resets the stack to [].
ensure_invocation_span() — idempotent within the same invocation; clears stale records only when the invocation_id differs.
after_run_callback — uses try/finally to guarantee clear_stack(), _active_invocation_id_ctx, and _root_agent_name_ctx are always reset, even if _log_event raises (since _safe_callback would swallow the exception silently).

Known limitation

Trace context storage (_span_records_ctx, _active_invocation_id_ctx, _root_agent_name_ctx) is module-global, not plugin-instance-scoped. This is safe in practice because: (1) PluginManager enforces name-uniqueness, preventing duplicate BQ plugins on the same Runner; (2) concurrent asyncio tasks get isolated contextvar copies. Acknowledged in code comment.

Test plan

TestForkSafety (4 tests): PID mismatch triggers re-init, same PID skips reset, _reset_runtime_state clears fields, __getstate__ zeroes PID
TestAnalyticsViews (8 tests): views created on new/existing tables, disabled when create_views=False, errors logged not raised, SQL correctness, config default, explicit async method, startup failure
TestTraceIdContinuity (5 tests): no-ambient span continuity, ambient span continuity, cross-turn isolation, INVOCATION_COMPLETED continuity, full callback-chain integration
TestResolveIds (7 tests): plugin stack defaults, span_id/parent_span_id overrides, ambient OTel priority, override-beats-ambient, ambient root no self-parent, ambient span used for completed events
TestSpanIdConsistency (3 tests): STARTING/COMPLETED same span with ambient, plugin span without ambient, tool error captures span_id
TestStackLeakSafety (6 tests): ensure_invocation_span clears stale records, clear_stack ends owned spans, after_run_callback clears remaining stack, next invocation clean after incomplete previous, ensure_invocation_span idempotent within same invocation, on_user_message → before_run same trace_id regression test
TestRootAgentNameAcrossInvocations (1 test): two invocations with different root agents log correct names
TestAfterRunCleanupExceptionSafety (1 test): cleanup runs even when _log_event raises
All 173 tests pass

🤖 Generated with Claude Code

Add PID tracking to detect post-fork broken gRPC channels (google#4636) and automatically create per-event-type BigQuery views that unnest JSON columns into typed, queryable columns (google#4639). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-02-27T06:58:43Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

google-cla · 2026-02-27T06:58:56Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

adk-bot · 2026-02-27T06:59:37Z

Response from ADK Triaging Agent

Hello @caohy1988, thank you for creating this PR!

It looks like you have not yet signed the Contributor License Agreement (CLA). Please visit https://cla.developers.google.com/ to sign it.

Once you've signed the CLA, the status of this check will be updated. Thanks!

- Backfill _init_pid in __setstate__ for legacy pickle compatibility - Gate view creation on successful table creation (skip after failure) - Ensure plugin is started in public create_analytics_views() - Add tests for all three edge cases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

create_analytics_views() now raises RuntimeError if _ensure_started() could not initialize the plugin, instead of silently proceeding with a None client and logging per-view errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Persist the init exception as _startup_error and use `raise ... from self._startup_error` so callers get actionable context instead of a generic RuntimeError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix trace_id fracture between INVOCATION_STARTING and AGENT_STARTING when no ambient OTel span exists (e.g. Agent Engine, custom runners). Also align BQ rows with Cloud Trace span IDs when o11y is active. Changes: - TraceManager.ensure_invocation_span(): ensures a root span is on the plugin stack before any events fire, preventing early events from falling back to invocation_id while later events get OTel hex IDs. - TraceManager.push_span(): create child spans under existing stack parent via trace.set_span_in_context() so all spans in an invocation share the same trace_id (google#4645). - _resolve_ids() replaces _resolve_span_ids(): 3-layer priority for trace_id/span_id/parent_span_id resolution: 1. EventData overrides (post-pop callbacks) 2. Ambient OTel span (aligns with Cloud Trace when o11y is active) 3. Plugin's internal span stack (fallback for no-ambient paths) - EventData.trace_id_override: captured before pop in after_run_callback so INVOCATION_COMPLETED shares the same trace_id as earlier events. - on_user_message_callback / before_run_callback: call ensure_invocation_span() before logging. - after_run_callback: capture trace_id before pop, pass as override. Tests: - TestTraceIdContinuity: 5 tests covering no-ambient, ambient, cross-turn isolation, completion-event continuity, and full callback-level integration. - TestResolveIds: 6 tests replacing TestResolveSpanIds, adding ambient priority and override-beats-ambient coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

adk-bot · 2026-02-28T08:26:45Z

Response from ADK Triaging Agent

Hello @caohy1988, thank you for your contribution!

Your PR includes a comprehensive testing plan. Could you please also include a summary of the passed pytest results in the PR description as mentioned in the contribution guidelines?

This will help the reviewers to verify the tests and speed up the review process. Thanks!

When the ambient OTel span is a root (no parent), _resolve_ids() was leaving the stale plugin-stack parent_span_id in place, producing invalid self-parent lineage (span_id == parent_span_id). Reset parent_span_id to None when ambient takes over, then only set it if the ambient span has a valid parent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

caohy1988 · 2026-02-28T16:02:27Z

ERROR: Found 10 breaking change(s) in API surface.
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(span_id_override): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(parent_span_id_override): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(latency_ms): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(time_to_first_token_ms): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(model): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(model_version): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(usage_metadata): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(status): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(error_message): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(extra_attributes): Positional parameter was moved

…l API The trace_id_override field was added at the top of EventData, which shifted all existing positional parameters and broke the public API surface. Move it after extra_attributes to preserve backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

caohy1988 · 2026-02-28T16:05:04Z

Good catch — the API surface checker is correct.

Root cause: I added trace_id_override as the first field in the EventData dataclass, which shifted all 10 existing positional parameters (span_id_override, parent_span_id_override, latency_ms, etc.).

Fix (commit b5bdc30): Moved trace_id_override to the end of the field list (after extra_attributes), preserving the original positional parameter order. All fields have defaults so keyword usage (which is how the plugin itself and all tests use EventData) is unaffected.

All 161 unit tests pass after the fix.

Add kw_only=True to the EventData dataclass so that field ordering can never break callers. All existing usages already pass keyword arguments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

P1 — Span-ID inconsistency: *_STARTING events used ambient framework span IDs via _resolve_ids Layer 2, while *_COMPLETED events bypassed Layer 2 by passing explicit span_id_override from the plugin's popped span. Now completion callbacks (after_agent, after_model, on_model_error, after_tool, on_tool_error, after_run) check for ambient OTel and pass None overrides when present, letting _resolve_ids use the framework's ambient span — keeping STARTING/COMPLETED pairs consistent. P2 — Stack leak on abnormal exit: Added TraceManager.clear_stack() to end owned spans and reset the stack. ensure_invocation_span() now clears stale records instead of no-op'ing, and after_run_callback calls clear_stack() as a safety net. Bonus: on_tool_error_callback previously discarded the span_id from pop_span(); now captures it for correct TOOL_ERROR span attribution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

adk-bot · 2026-02-28T16:52:12Z

Response from ADK Triaging Agent

Hello @caohy1988, thank you for your contribution!

I see that this PR addresses a number of fixes and improvements. Could you please create a GitHub issue that describes the problems this PR is solving and associate it with this pull request? As per our contribution guidelines, all PRs, other than small documentation or typo fixes, should have an Issue associated.

This will help us to track the feature/bug fix and to maintain a clear record of the changes in the repository.

Thanks!

The previous clear-on-non-empty behavior broke trace continuity between on_user_message_callback and before_run_callback in the no-ambient OTel path: both call ensure_invocation_span(), and the second call would clear the stack and create a new root span with a different trace_id. Fix: track the active invocation_id in a contextvar. If the stack has records belonging to the current invocation, return early (idempotent). If the stack has records from a different invocation (stale leak), clear and re-init. Reset the active invocation_id in after_run_callback. Also updates the docstring to match the new behavior and adds regression tests for the USER_MESSAGE_RECEIVED → INVOCATION_STARTING boundary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

init_trace() previously only set _root_agent_name_ctx when it was None, so the second invocation with a different root agent would inherit the first's name. Now it sets unconditionally. after_run_callback also resets _root_agent_name_ctx alongside the other invocation cleanup. Also adds a NOTE comment acknowledging that trace contextvars are module-global (not plugin-instance-scoped) as a known limitation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PluginManager enforces name-uniqueness (no duplicate BQ plugins on same Runner), and concurrent asyncio tasks get isolated contextvar copies. Updated the NOTE comment to explain both guards. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The cleanup code (clear_stack, reset _active_invocation_id_ctx and _root_agent_name_ctx) ran after _log_event, so a failure in _log_event would skip cleanup silently (_safe_callback swallows the exception). Wrapping with try/finally ensures invocation state is always reset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…in tests Import CallbackContext and InvocationContext directly from their modules instead of importing the module and using qualified names. This matches the import pattern used throughout the test suite. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace all relative imports (e.g. from ..agents.callback_context) with absolute imports (e.g. from google.adk.agents.callback_context) for clarity and consistency with the test file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

adk-bot added the services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc label Feb 27, 2026

caohy1988 and others added 4 commits February 26, 2026 23:12

fix: chain root cause in create_analytics_views() startup failure

b82b263

Persist the init exception as _startup_error and use `raise ... from self._startup_error` so callers get actionable context instead of a generic RuntimeError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

adk-bot added the tracing [Component] This issue is related to OpenTelemetry tracing label Feb 28, 2026

caohy1988 and others added 2 commits February 28, 2026 08:20

fix: make EventData kw_only to prevent positional parameter breakage

5da0421

Add kw_only=True to the EventData dataclass so that field ordering can never break callers. All existing usages already pass keyword arguments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

caohy1988 and others added 6 commits February 28, 2026 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add fork-safety and auto-create analytics views to BQ plugin#4640

fix: add fork-safety and auto-create analytics views to BQ plugin#4640
caohy1988 wants to merge 15 commits intogoogle:mainfrom
caohy1988:fix/bq-lazy-init-and-auto-views

caohy1988 commented Feb 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 27, 2026

Uh oh!

google-cla bot commented Feb 27, 2026

Uh oh!

adk-bot commented Feb 27, 2026

Uh oh!

adk-bot commented Feb 28, 2026

Uh oh!

caohy1988 commented Feb 28, 2026

Uh oh!

caohy1988 commented Feb 28, 2026

Uh oh!

adk-bot commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

caohy1988 commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Trace-ID fix details (#4645)

Span-ID consistency fix details

Stack leak safety details

Known limitation

Test plan

Uh oh!

gemini-code-assist bot commented Feb 27, 2026

Uh oh!

google-cla bot commented Feb 27, 2026

Uh oh!

adk-bot commented Feb 27, 2026

Uh oh!

adk-bot commented Feb 28, 2026

Uh oh!

caohy1988 commented Feb 28, 2026

Uh oh!

caohy1988 commented Feb 28, 2026

Uh oh!

adk-bot commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

caohy1988 commented Feb 27, 2026 •

edited

Loading