fix: add fork-safety and auto-create analytics views to BQ plugin#4640
fix: add fork-safety and auto-create analytics views to BQ plugin#4640caohy1988 wants to merge 15 commits intogoogle:mainfrom
Conversation
Add PID tracking to detect post-fork broken gRPC channels (google#4636) and automatically create per-event-type BigQuery views that unnest JSON columns into typed, queryable columns (google#4639). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
Response from ADK Triaging Agent Hello @caohy1988, thank you for creating this PR! It looks like you have not yet signed the Contributor License Agreement (CLA). Please visit https://cla.developers.google.com/ to sign it. Once you've signed the CLA, the status of this check will be updated. Thanks! |
- Backfill _init_pid in __setstate__ for legacy pickle compatibility - Gate view creation on successful table creation (skip after failure) - Ensure plugin is started in public create_analytics_views() - Add tests for all three edge cases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
create_analytics_views() now raises RuntimeError if _ensure_started() could not initialize the plugin, instead of silently proceeding with a None client and logging per-view errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Persist the init exception as _startup_error and use `raise ... from self._startup_error` so callers get actionable context instead of a generic RuntimeError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix trace_id fracture between INVOCATION_STARTING and AGENT_STARTING when no ambient OTel span exists (e.g. Agent Engine, custom runners). Also align BQ rows with Cloud Trace span IDs when o11y is active. Changes: - TraceManager.ensure_invocation_span(): ensures a root span is on the plugin stack before any events fire, preventing early events from falling back to invocation_id while later events get OTel hex IDs. - TraceManager.push_span(): create child spans under existing stack parent via trace.set_span_in_context() so all spans in an invocation share the same trace_id (google#4645). - _resolve_ids() replaces _resolve_span_ids(): 3-layer priority for trace_id/span_id/parent_span_id resolution: 1. EventData overrides (post-pop callbacks) 2. Ambient OTel span (aligns with Cloud Trace when o11y is active) 3. Plugin's internal span stack (fallback for no-ambient paths) - EventData.trace_id_override: captured before pop in after_run_callback so INVOCATION_COMPLETED shares the same trace_id as earlier events. - on_user_message_callback / before_run_callback: call ensure_invocation_span() before logging. - after_run_callback: capture trace_id before pop, pass as override. Tests: - TestTraceIdContinuity: 5 tests covering no-ambient, ambient, cross-turn isolation, completion-event continuity, and full callback-level integration. - TestResolveIds: 6 tests replacing TestResolveSpanIds, adding ambient priority and override-beats-ambient coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Response from ADK Triaging Agent Hello @caohy1988, thank you for your contribution! Your PR includes a comprehensive testing plan. Could you please also include a summary of the passed This will help the reviewers to verify the tests and speed up the review process. Thanks! |
When the ambient OTel span is a root (no parent), _resolve_ids() was leaving the stale plugin-stack parent_span_id in place, producing invalid self-parent lineage (span_id == parent_span_id). Reset parent_span_id to None when ambient takes over, then only set it if the ambient span has a valid parent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
ERROR: Found 10 breaking change(s) in API surface. |
…l API The trace_id_override field was added at the top of EventData, which shifted all existing positional parameters and broke the public API surface. Move it after extra_attributes to preserve backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Good catch — the API surface checker is correct. Root cause: I added Fix (commit b5bdc30): Moved All 161 unit tests pass after the fix. |
Add kw_only=True to the EventData dataclass so that field ordering can never break callers. All existing usages already pass keyword arguments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
P1 — Span-ID inconsistency: *_STARTING events used ambient framework span IDs via _resolve_ids Layer 2, while *_COMPLETED events bypassed Layer 2 by passing explicit span_id_override from the plugin's popped span. Now completion callbacks (after_agent, after_model, on_model_error, after_tool, on_tool_error, after_run) check for ambient OTel and pass None overrides when present, letting _resolve_ids use the framework's ambient span — keeping STARTING/COMPLETED pairs consistent. P2 — Stack leak on abnormal exit: Added TraceManager.clear_stack() to end owned spans and reset the stack. ensure_invocation_span() now clears stale records instead of no-op'ing, and after_run_callback calls clear_stack() as a safety net. Bonus: on_tool_error_callback previously discarded the span_id from pop_span(); now captures it for correct TOOL_ERROR span attribution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Response from ADK Triaging Agent Hello @caohy1988, thank you for your contribution! I see that this PR addresses a number of fixes and improvements. Could you please create a GitHub issue that describes the problems this PR is solving and associate it with this pull request? As per our contribution guidelines, all PRs, other than small documentation or typo fixes, should have an Issue associated. This will help us to track the feature/bug fix and to maintain a clear record of the changes in the repository. Thanks! |
The previous clear-on-non-empty behavior broke trace continuity between on_user_message_callback and before_run_callback in the no-ambient OTel path: both call ensure_invocation_span(), and the second call would clear the stack and create a new root span with a different trace_id. Fix: track the active invocation_id in a contextvar. If the stack has records belonging to the current invocation, return early (idempotent). If the stack has records from a different invocation (stale leak), clear and re-init. Reset the active invocation_id in after_run_callback. Also updates the docstring to match the new behavior and adds regression tests for the USER_MESSAGE_RECEIVED → INVOCATION_STARTING boundary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
init_trace() previously only set _root_agent_name_ctx when it was None, so the second invocation with a different root agent would inherit the first's name. Now it sets unconditionally. after_run_callback also resets _root_agent_name_ctx alongside the other invocation cleanup. Also adds a NOTE comment acknowledging that trace contextvars are module-global (not plugin-instance-scoped) as a known limitation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PluginManager enforces name-uniqueness (no duplicate BQ plugins on same Runner), and concurrent asyncio tasks get isolated contextvar copies. Updated the NOTE comment to explain both guards. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The cleanup code (clear_stack, reset _active_invocation_id_ctx and _root_agent_name_ctx) ran after _log_event, so a failure in _log_event would skip cleanup silently (_safe_callback swallows the exception). Wrapping with try/finally ensures invocation state is always reset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…in tests Import CallbackContext and InvocationContext directly from their modules instead of importing the module and using qualified names. This matches the import pattern used throughout the test suite. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace all relative imports (e.g. from ..agents.callback_context) with absolute imports (e.g. from google.adk.agents.callback_context) for clarity and consistency with the test file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
BigQueryAgentAnalyticsPluginso that forked child processes detect stale gRPC channels and re-initialize instead of deadlocking. Uses the standardos.getpid()pattern (same as SQLAlchemy, gRPC-python).agent_eventstable exists, automatically creates 15 per-event-type BigQuery views (v_llm_request,v_tool_completed, etc.) that unnest JSON columns into typed, queryable columns. Controlled byBigQueryLoggerConfig.create_views(defaultTrue), idempotent viaCREATE OR REPLACE VIEW.*_STARTING/*_COMPLETEDevents producing mismatched span IDs when an ambient OTel span is active. Completion callbacks now check for ambient spans and defer to_resolve_idsLayer 2 instead of overriding with plugin-synthetic IDs.TraceManager.clear_stack()and makesensure_invocation_span()clear stale records from different invocations, preventing span stack leaks across invocations. Uses_active_invocation_id_ctxto distinguish stale leak vs same-invocation re-entry.init_trace()now refreshes_root_agent_name_ctxunconditionally on each invocation (previously set-once-on-None).after_run_callbackresets it alongside other invocation cleanup.after_run_callbackusestry/finallyto guarantee invocation state (clear_stack,_active_invocation_id_ctx,_root_agent_name_ctx) is always reset, even if_log_eventraises.on_tool_error_callbackspan fix: Previously discarded the span_id frompop_span(), causing TOOL_ERROR events to get the wrong span. Now captures and uses the popped span_id.Trace-ID fix details (#4645)
Root cause: When OTel is configured but the Runner's ambient span is absent (Agent Engine, custom runners), early events fell back to
invocation_idwhilepush_span("agent")created a new OTel span with a different hex trace_id.Fix:
TraceManager.ensure_invocation_span()— called inon_user_message_callbackandbefore_run_callbackbefore any events fire. Attaches the ambient span if valid, or creates a new root span. Idempotent within the same invocation via_active_invocation_id_ctx.push_span()— now creates child spans under existing stack parent viatrace.set_span_in_context(), so all spans share one trace_id._resolve_ids()(replaces_resolve_span_ids()) — 3-layer priority for trace_id/span_id/parent_span_id:EventDataoverrides (post-pop callbacks likeafter_run_callback)EventData.trace_id_override— captured before popping the invocation span inafter_run_callback, ensuring INVOCATION_COMPLETED shares the same trace_id.Result when o11y is enabled: BQ rows use the same trace_id and span_id as the framework's Cloud Trace spans (invocation, invoke_agent, call_llm, execute_tool).
Span-ID consistency fix details
Root cause:
*_STARTINGevents used ambient framework span IDs via_resolve_idsLayer 2, while*_COMPLETEDevents bypassed Layer 2 by passing explicitspan_id_override/parent_span_id_overridefrom the plugin's internal popped span. This meant a single lifecycle (e.g. AGENT_STARTING → AGENT_COMPLETED) could mix ambient framework IDs and plugin synthetic IDs.Fix: Six completion callbacks (
after_agent,after_model,on_model_error,after_tool,on_tool_error,after_run) now checktrace.get_current_span().get_span_context().is_validand passNoneoverrides when ambient exists, letting_resolve_idsLayer 2 use the framework's ambient span — keeping STARTING/COMPLETED pairs consistent.Span-ID resolution contract:
Stack leak safety details
Root cause:
_span_records_ctxis not keyed by invocation and relies on balanced push/pop. Ifafter_run_callbackis skipped (e.g. exception in generator, asyncio cancellation), stale records leak into the next invocation.Fix:
_active_invocation_id_ctx— new contextvar tracking the invocation that owns the current stack. Allowsensure_invocation_span()to distinguish same-invocation re-entry (idempotent no-op) from stale leak (clear and re-init).TraceManager.clear_stack()— ends all owned spans and resets the stack to[].ensure_invocation_span()— idempotent within the same invocation; clears stale records only when the invocation_id differs.after_run_callback— usestry/finallyto guaranteeclear_stack(),_active_invocation_id_ctx, and_root_agent_name_ctxare always reset, even if_log_eventraises (since_safe_callbackwould swallow the exception silently).Known limitation
Trace context storage (
_span_records_ctx,_active_invocation_id_ctx,_root_agent_name_ctx) is module-global, not plugin-instance-scoped. This is safe in practice because: (1)PluginManagerenforces name-uniqueness, preventing duplicate BQ plugins on the same Runner; (2) concurrent asyncio tasks get isolated contextvar copies. Acknowledged in code comment.Test plan
TestForkSafety(4 tests): PID mismatch triggers re-init, same PID skips reset,_reset_runtime_stateclears fields,__getstate__zeroes PIDTestAnalyticsViews(8 tests): views created on new/existing tables, disabled whencreate_views=False, errors logged not raised, SQL correctness, config default, explicit async method, startup failureTestTraceIdContinuity(5 tests): no-ambient span continuity, ambient span continuity, cross-turn isolation, INVOCATION_COMPLETED continuity, full callback-chain integrationTestResolveIds(7 tests): plugin stack defaults, span_id/parent_span_id overrides, ambient OTel priority, override-beats-ambient, ambient root no self-parent, ambient span used for completed eventsTestSpanIdConsistency(3 tests): STARTING/COMPLETED same span with ambient, plugin span without ambient, tool error captures span_idTestStackLeakSafety(6 tests): ensure_invocation_span clears stale records, clear_stack ends owned spans, after_run_callback clears remaining stack, next invocation clean after incomplete previous, ensure_invocation_span idempotent within same invocation, on_user_message → before_run same trace_id regression testTestRootAgentNameAcrossInvocations(1 test): two invocations with different root agents log correct namesTestAfterRunCleanupExceptionSafety(1 test): cleanup runs even when _log_event raises🤖 Generated with Claude Code