Skip to content

fix: add fork-safety and auto-create analytics views to BQ plugin#4640

Draft
caohy1988 wants to merge 15 commits intogoogle:mainfrom
caohy1988:fix/bq-lazy-init-and-auto-views
Draft

fix: add fork-safety and auto-create analytics views to BQ plugin#4640
caohy1988 wants to merge 15 commits intogoogle:mainfrom
caohy1988:fix/bq-lazy-init-and-auto-views

Conversation

@caohy1988
Copy link

@caohy1988 caohy1988 commented Feb 27, 2026

Summary

  • Fork-safety (BigQueryAgentAnalyticsPlugin Deadlock on Linux Multiprocessing #4636): Adds PID tracking to BigQueryAgentAnalyticsPlugin so that forked child processes detect stale gRPC channels and re-initialize instead of deadlocking. Uses the standard os.getpid() pattern (same as SQLAlchemy, gRPC-python).
  • Auto-create views (Auto-create BigQuery views when BQ Agent Analytics plugin creates a new table #4639): After ensuring the agent_events table exists, automatically creates 15 per-event-type BigQuery views (v_llm_request, v_tool_completed, etc.) that unnest JSON columns into typed, queryable columns. Controlled by BigQueryLoggerConfig.create_views (default True), idempotent via CREATE OR REPLACE VIEW.
  • Trace-ID continuity & o11y alignment ([ADK BigQuery Plugin] Event trace_id fractures between INVOCATION_STARTING and AGENT_STARTING #4645): Fixes trace_id fracture between early events (USER_MESSAGE_RECEIVED, INVOCATION_STARTING) and later events (AGENT_STARTING onwards) when no ambient OTel span exists. Also aligns BQ rows with Cloud Trace span IDs when o11y is active.
  • Span-ID consistency under ambient OTel (fix: add fork-safety and auto-create analytics views to BQ plugin #4640 review): Fixes *_STARTING / *_COMPLETED events producing mismatched span IDs when an ambient OTel span is active. Completion callbacks now check for ambient spans and defer to _resolve_ids Layer 2 instead of overriding with plugin-synthetic IDs.
  • Stack leak safety (fix: add fork-safety and auto-create analytics views to BQ plugin #4640 review): Adds TraceManager.clear_stack() and makes ensure_invocation_span() clear stale records from different invocations, preventing span stack leaks across invocations. Uses _active_invocation_id_ctx to distinguish stale leak vs same-invocation re-entry.
  • Root agent name staleness fix: init_trace() now refreshes _root_agent_name_ctx unconditionally on each invocation (previously set-once-on-None). after_run_callback resets it alongside other invocation cleanup.
  • Exception-safe cleanup: after_run_callback uses try/finally to guarantee invocation state (clear_stack, _active_invocation_id_ctx, _root_agent_name_ctx) is always reset, even if _log_event raises.
  • Bonus: on_tool_error_callback span fix: Previously discarded the span_id from pop_span(), causing TOOL_ERROR events to get the wrong span. Now captures and uses the popped span_id.

Trace-ID fix details (#4645)

Root cause: When OTel is configured but the Runner's ambient span is absent (Agent Engine, custom runners), early events fell back to invocation_id while push_span("agent") created a new OTel span with a different hex trace_id.

Fix:

  • TraceManager.ensure_invocation_span() — called in on_user_message_callback and before_run_callback before any events fire. Attaches the ambient span if valid, or creates a new root span. Idempotent within the same invocation via _active_invocation_id_ctx.
  • push_span() — now creates child spans under existing stack parent via trace.set_span_in_context(), so all spans share one trace_id.
  • _resolve_ids() (replaces _resolve_span_ids()) — 3-layer priority for trace_id/span_id/parent_span_id:
    1. EventData overrides (post-pop callbacks like after_run_callback)
    2. Ambient OTel span (aligns BQ rows with Cloud Trace when o11y is active)
    3. Plugin's internal span stack (fallback for no-ambient paths)
  • EventData.trace_id_override — captured before popping the invocation span in after_run_callback, ensuring INVOCATION_COMPLETED shares the same trace_id.

Result when o11y is enabled: BQ rows use the same trace_id and span_id as the framework's Cloud Trace spans (invocation, invoke_agent, call_llm, execute_tool).

Span-ID consistency fix details

Root cause: *_STARTING events used ambient framework span IDs via _resolve_ids Layer 2, while *_COMPLETED events bypassed Layer 2 by passing explicit span_id_override / parent_span_id_override from the plugin's internal popped span. This meant a single lifecycle (e.g. AGENT_STARTING → AGENT_COMPLETED) could mix ambient framework IDs and plugin synthetic IDs.

Fix: Six completion callbacks (after_agent, after_model, on_model_error, after_tool, on_tool_error, after_run) now check trace.get_current_span().get_span_context().is_valid and pass None overrides when ambient exists, letting _resolve_ids Layer 2 use the framework's ambient span — keeping STARTING/COMPLETED pairs consistent.

Span-ID resolution contract:

  • When OTel is active: BQ rows use the same trace/span/parent IDs as Cloud Trace. STARTING and COMPLETED events in the same lifecycle share the same span_id.
  • When OTel is not active: BQ rows use the plugin's internal span stack. STARTING gets the current top-of-stack; COMPLETED gets the popped span.

Stack leak safety details

Root cause: _span_records_ctx is not keyed by invocation and relies on balanced push/pop. If after_run_callback is skipped (e.g. exception in generator, asyncio cancellation), stale records leak into the next invocation.

Fix:

  • _active_invocation_id_ctx — new contextvar tracking the invocation that owns the current stack. Allows ensure_invocation_span() to distinguish same-invocation re-entry (idempotent no-op) from stale leak (clear and re-init).
  • TraceManager.clear_stack() — ends all owned spans and resets the stack to [].
  • ensure_invocation_span() — idempotent within the same invocation; clears stale records only when the invocation_id differs.
  • after_run_callback — uses try/finally to guarantee clear_stack(), _active_invocation_id_ctx, and _root_agent_name_ctx are always reset, even if _log_event raises (since _safe_callback would swallow the exception silently).

Known limitation

Trace context storage (_span_records_ctx, _active_invocation_id_ctx, _root_agent_name_ctx) is module-global, not plugin-instance-scoped. This is safe in practice because: (1) PluginManager enforces name-uniqueness, preventing duplicate BQ plugins on the same Runner; (2) concurrent asyncio tasks get isolated contextvar copies. Acknowledged in code comment.

Test plan

  • TestForkSafety (4 tests): PID mismatch triggers re-init, same PID skips reset, _reset_runtime_state clears fields, __getstate__ zeroes PID
  • TestAnalyticsViews (8 tests): views created on new/existing tables, disabled when create_views=False, errors logged not raised, SQL correctness, config default, explicit async method, startup failure
  • TestTraceIdContinuity (5 tests): no-ambient span continuity, ambient span continuity, cross-turn isolation, INVOCATION_COMPLETED continuity, full callback-chain integration
  • TestResolveIds (7 tests): plugin stack defaults, span_id/parent_span_id overrides, ambient OTel priority, override-beats-ambient, ambient root no self-parent, ambient span used for completed events
  • TestSpanIdConsistency (3 tests): STARTING/COMPLETED same span with ambient, plugin span without ambient, tool error captures span_id
  • TestStackLeakSafety (6 tests): ensure_invocation_span clears stale records, clear_stack ends owned spans, after_run_callback clears remaining stack, next invocation clean after incomplete previous, ensure_invocation_span idempotent within same invocation, on_user_message → before_run same trace_id regression test
  • TestRootAgentNameAcrossInvocations (1 test): two invocations with different root agents log correct names
  • TestAfterRunCleanupExceptionSafety (1 test): cleanup runs even when _log_event raises
  • All 173 tests pass

🤖 Generated with Claude Code

Add PID tracking to detect post-fork broken gRPC channels (google#4636) and
automatically create per-event-type BigQuery views that unnest JSON
columns into typed, queryable columns (google#4639).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@google-cla
Copy link

google-cla bot commented Feb 27, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@adk-bot adk-bot added the services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc label Feb 27, 2026
@adk-bot
Copy link
Collaborator

adk-bot commented Feb 27, 2026

Response from ADK Triaging Agent

Hello @caohy1988, thank you for creating this PR!

It looks like you have not yet signed the Contributor License Agreement (CLA). Please visit https://cla.developers.google.com/ to sign it.

Once you've signed the CLA, the status of this check will be updated. Thanks!

caohy1988 and others added 4 commits February 26, 2026 23:12
- Backfill _init_pid in __setstate__ for legacy pickle compatibility
- Gate view creation on successful table creation (skip after failure)
- Ensure plugin is started in public create_analytics_views()
- Add tests for all three edge cases

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
create_analytics_views() now raises RuntimeError if _ensure_started()
could not initialize the plugin, instead of silently proceeding with
a None client and logging per-view errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Persist the init exception as _startup_error and use `raise ... from
self._startup_error` so callers get actionable context instead of a
generic RuntimeError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix trace_id fracture between INVOCATION_STARTING and AGENT_STARTING
when no ambient OTel span exists (e.g. Agent Engine, custom runners).
Also align BQ rows with Cloud Trace span IDs when o11y is active.

Changes:
- TraceManager.ensure_invocation_span(): ensures a root span is on the
  plugin stack before any events fire, preventing early events from
  falling back to invocation_id while later events get OTel hex IDs.
- TraceManager.push_span(): create child spans under existing stack
  parent via trace.set_span_in_context() so all spans in an invocation
  share the same trace_id (google#4645).
- _resolve_ids() replaces _resolve_span_ids(): 3-layer priority for
  trace_id/span_id/parent_span_id resolution:
    1. EventData overrides (post-pop callbacks)
    2. Ambient OTel span (aligns with Cloud Trace when o11y is active)
    3. Plugin's internal span stack (fallback for no-ambient paths)
- EventData.trace_id_override: captured before pop in after_run_callback
  so INVOCATION_COMPLETED shares the same trace_id as earlier events.
- on_user_message_callback / before_run_callback: call
  ensure_invocation_span() before logging.
- after_run_callback: capture trace_id before pop, pass as override.

Tests:
- TestTraceIdContinuity: 5 tests covering no-ambient, ambient,
  cross-turn isolation, completion-event continuity, and full
  callback-level integration.
- TestResolveIds: 6 tests replacing TestResolveSpanIds, adding
  ambient priority and override-beats-ambient coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@adk-bot adk-bot added the tracing [Component] This issue is related to OpenTelemetry tracing label Feb 28, 2026
@adk-bot
Copy link
Collaborator

adk-bot commented Feb 28, 2026

Response from ADK Triaging Agent

Hello @caohy1988, thank you for your contribution!

Your PR includes a comprehensive testing plan. Could you please also include a summary of the passed pytest results in the PR description as mentioned in the contribution guidelines?

This will help the reviewers to verify the tests and speed up the review process. Thanks!

When the ambient OTel span is a root (no parent), _resolve_ids() was
leaving the stale plugin-stack parent_span_id in place, producing
invalid self-parent lineage (span_id == parent_span_id).

Reset parent_span_id to None when ambient takes over, then only set
it if the ambient span has a valid parent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@caohy1988
Copy link
Author

ERROR: Found 10 breaking change(s) in API surface.
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(span_id_override): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(parent_span_id_override): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(latency_ms): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(time_to_first_token_ms): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(model): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(model_version): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(usage_metadata): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(status): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(error_message): Positional parameter was moved
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:0: EventData.init(extra_attributes): Positional parameter was moved

…l API

The trace_id_override field was added at the top of EventData, which
shifted all existing positional parameters and broke the public API
surface. Move it after extra_attributes to preserve backward
compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@caohy1988
Copy link
Author

Good catch — the API surface checker is correct.

Root cause: I added trace_id_override as the first field in the EventData dataclass, which shifted all 10 existing positional parameters (span_id_override, parent_span_id_override, latency_ms, etc.).

Fix (commit b5bdc30): Moved trace_id_override to the end of the field list (after extra_attributes), preserving the original positional parameter order. All fields have defaults so keyword usage (which is how the plugin itself and all tests use EventData) is unaffected.

All 161 unit tests pass after the fix.

caohy1988 and others added 2 commits February 28, 2026 08:20
Add kw_only=True to the EventData dataclass so that field ordering
can never break callers. All existing usages already pass keyword
arguments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
P1 — Span-ID inconsistency: *_STARTING events used ambient framework
span IDs via _resolve_ids Layer 2, while *_COMPLETED events bypassed
Layer 2 by passing explicit span_id_override from the plugin's popped
span. Now completion callbacks (after_agent, after_model, on_model_error,
after_tool, on_tool_error, after_run) check for ambient OTel and pass
None overrides when present, letting _resolve_ids use the framework's
ambient span — keeping STARTING/COMPLETED pairs consistent.

P2 — Stack leak on abnormal exit: Added TraceManager.clear_stack() to
end owned spans and reset the stack. ensure_invocation_span() now clears
stale records instead of no-op'ing, and after_run_callback calls
clear_stack() as a safety net.

Bonus: on_tool_error_callback previously discarded the span_id from
pop_span(); now captures it for correct TOOL_ERROR span attribution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@adk-bot
Copy link
Collaborator

adk-bot commented Feb 28, 2026

Response from ADK Triaging Agent

Hello @caohy1988, thank you for your contribution!

I see that this PR addresses a number of fixes and improvements. Could you please create a GitHub issue that describes the problems this PR is solving and associate it with this pull request? As per our contribution guidelines, all PRs, other than small documentation or typo fixes, should have an Issue associated.

This will help us to track the feature/bug fix and to maintain a clear record of the changes in the repository.

Thanks!

caohy1988 and others added 6 commits February 28, 2026 09:04
The previous clear-on-non-empty behavior broke trace continuity between
on_user_message_callback and before_run_callback in the no-ambient OTel
path: both call ensure_invocation_span(), and the second call would
clear the stack and create a new root span with a different trace_id.

Fix: track the active invocation_id in a contextvar. If the stack has
records belonging to the current invocation, return early (idempotent).
If the stack has records from a different invocation (stale leak), clear
and re-init. Reset the active invocation_id in after_run_callback.

Also updates the docstring to match the new behavior and adds regression
tests for the USER_MESSAGE_RECEIVED → INVOCATION_STARTING boundary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
init_trace() previously only set _root_agent_name_ctx when it was None,
so the second invocation with a different root agent would inherit the
first's name. Now it sets unconditionally. after_run_callback also
resets _root_agent_name_ctx alongside the other invocation cleanup.

Also adds a NOTE comment acknowledging that trace contextvars are
module-global (not plugin-instance-scoped) as a known limitation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PluginManager enforces name-uniqueness (no duplicate BQ plugins on
same Runner), and concurrent asyncio tasks get isolated contextvar
copies. Updated the NOTE comment to explain both guards.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The cleanup code (clear_stack, reset _active_invocation_id_ctx and
_root_agent_name_ctx) ran after _log_event, so a failure in _log_event
would skip cleanup silently (_safe_callback swallows the exception).
Wrapping with try/finally ensures invocation state is always reset.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…in tests

Import CallbackContext and InvocationContext directly from their
modules instead of importing the module and using qualified names.
This matches the import pattern used throughout the test suite.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace all relative imports (e.g. from ..agents.callback_context)
with absolute imports (e.g. from google.adk.agents.callback_context)
for clarity and consistency with the test file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc tracing [Component] This issue is related to OpenTelemetry tracing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants