From a9941109d7c0c79dbc40ba95a06e6de5090beb07 Mon Sep 17 00:00:00 2001 From: Luke Zhang Date: Sun, 24 May 2026 12:31:33 -0700 Subject: [PATCH] [EXPORTER ETW] Fix Span::End() use-after-free when caller drops Tracer Fixes #3849. etw::TracerProvider::GetTracer() was newing up a fresh Tracer per call and returning the only shared_ptr to it. The catch is that etw::Span stores its parent as a raw reference (Tracer &owner_), so the moment a caller drops or reassigns that shared_ptr, every Span that came from that Tracer is left with a dangling owner_. Span::End() then crashes inside owner_.EndSpan(). The snippet from the issue makes it obvious: auto provider = std::make_unique(); auto tracer = provider->GetTracer("Foo"); auto spanFoo = tracer->StartSpan("Span-Foo"); tracer = provider->GetTracer("Bar"); // Tracer "Foo" goes away auto spanBar = tracer->StartSpan("Span-Bar"); spanFoo->End(); // crash spanBar->End(); Under MSVC Debug + cdb this lands as an AV in Span::End dereferencing 0xFEEEFEEE -- the debug heap's freed-memory fill -- which is about as unambiguous as UAF gets. The fix is to let TracerProvider keep the Tracers it hands out, keyed by name and guarded by a mutex, and have GetTracer() return the cached entry on repeat calls. That makes the Tracer outlive any Span derived from it, which is the lifetime contract Span's raw owner_ already assumes. It is also the same pattern sdk::trace::TracerProvider has always used, which is why the issue reporter noted the SDK provider does not crash under the same usage. There was an earlier attempt in PR #4070 that switched Span to take a shared_ptr via shared_from_this(). That was turned down because StartSpan() is the ETW exporter's hot path and nobody wants an extra atomic refcount bump on every span just to keep the parent alive. Doing the lifetime work once in GetTracer() keeps owner_ as a plain reference and costs nothing per span. While putting this together I tripped over a second bug that the cache exposes -- worth describing here because the fix lives in the same file. etw_tracer.h and etw_provider.h between them have three independent function-local statics that get tangled at process shutdown: 1. etw::TracerProvider itself (now owning the tracer cache). 2. Tracer::etwProvider() returns a `static ETWProvider instance`. 3. ETWProvider::providers() returns a `static std::map<...> providers` that lives inside that singleton's method, not inside the singleton object, and is lazily created on the first open()/is_registered() call. Before this change, all three were initialized in that order on the very first GetTracer() call: the TracerProvider was already constructed, then the Tracer ctor called etwProvider().open(...), which constructed #2 and #3. Reverse-order destruction therefore took out #3 first, while the TracerProvider (and its cached Tracers, which hold references into #3) was still alive. ~Tracer then ran etwProvider().close(provHandle) against a dangling reference and we got an AV at exit. With the old "fresh Tracer per call, caller drops it immediately" pattern this was hidden because the Tracer (and its provHandle reference) was almost always gone long before shutdown. The cache extends Tracer lifetime to match TracerProvider lifetime, which is exactly what tripped this over locally and on the Windows CI legs. The fix is small: both TracerProvider constructors now do (void)Tracer::etwProvider().is_registered(std::string{}); which forces both ETWProvider singletons to be constructed before the TracerProvider's body finishes. Reverse destruction then guarantees they outlive the cached Tracers. is_registered() is a public, side- effect-free read on ETWProvider that internally touches the providers map, which is exactly the static-init we need. It requires Tracer to befriend TracerProvider so the constructor can call the private etwProvider() accessor. A few things worth calling out: - No API or ABI change. Only private members are added to TracerProvider, and // are already included by etw_tracer.h. The friend declaration is purely internal. - The cache deliberately does not evict. Span holds its parent by raw reference, so dropping a Tracer that still has live Spans would put the original bug right back. In practice the ETW exporter is built around "one Tracer per provider name, reused for the lifetime of the process", and Windows itself caps registered ETW providers per process well below anything you would worry about for memory. - Keying on name only is consistent with the existing etw::Tracer constructor, which already declares args and schema_url UNREFERENCED_PARAMETER. - etw::LoggerProvider was checked for the same shape and does not currently cache Loggers, so the destruction-order fix is not needed there today. If a similar cache is ever added to LoggerProvider it will want the same one-line force-init in its constructor. - Updated ETWTracer.GlobalSingletonTracer: it previously asserted that two GetTracer() calls with the same provider name produce different trace ids, which only held because of the bug being fixed here. GetTracer(name) is now idempotent per name, so the assertion is flipped to EXPECT_EQ and the stale sample-event comment updated to match. All other ETW tracer tests continue to pass. Repro before the change: AV in Span::End with the freed-fill operand, plus exit-time AV in ~Tracer on ctest runs that exercise a single test in isolation. Same repros after: both run cleanly to completion. Verified locally with the full ctest suite on a Windows abiv2 maintainer-mode build (1104/1104 passing) and on an abiv1 Debug build (43/43 ETW tests passing). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../opentelemetry/exporters/etw/etw_tracer.h | 67 ++++++++++++++++++- exporters/etw/test/etw_tracer_test.cc | 10 +-- 2 files changed, 70 insertions(+), 7 deletions(-) diff --git a/exporters/etw/include/opentelemetry/exporters/etw/etw_tracer.h b/exporters/etw/include/opentelemetry/exporters/etw/etw_tracer.h index 2b975e239c..b22b6873f9 100644 --- a/exporters/etw/include/opentelemetry/exporters/etw/etw_tracer.h +++ b/exporters/etw/include/opentelemetry/exporters/etw/etw_tracer.h @@ -169,6 +169,10 @@ class Tracer : public opentelemetry::trace::Tracer, bool IsClosed() const noexcept { return isClosed_.load(); } private: + // Allow TracerProvider to force-construct the ETWProvider singleton ahead + // of itself; see TracerProvider's constructor for the rationale. + friend class TracerProvider; + /** * @brief Parent provider of this Tracer */ @@ -1106,6 +1110,17 @@ class TracerProvider : public opentelemetry::trace::TracerProvider id_generator_{std::move(id_generator)}, tail_sampler_{std::move(tail_sampler)} { + // Force-initialize the process-wide ETWProvider singleton -- and its + // internal providers() map (lazily created on first ETWProvider method + // call) -- before this TracerProvider finishes constructing. Static + // destruction is reverse-of-construction order, so initializing both + // here guarantees that they outlive this provider's tracer cache: + // cached Tracer destructors call etwProvider().close(provHandle), and + // provHandle is a reference into that providers() map. Without this, + // either singleton could be destroyed first, leaving cached Tracer + // destructors with dangling state at process shutdown (#3849). + (void)Tracer::etwProvider().is_registered(std::string{}); + // By default we ensure that all events carry their with TraceId and SpanId GetOption(options, "enableTraceId", config_.enableTraceId, true); GetOption(options, "enableSpanId", config_.enableSpanId, true); @@ -1143,6 +1158,17 @@ class TracerProvider : public opentelemetry::trace::TracerProvider tail_sampler_{ std::unique_ptr(new AlwaysOnTailSampler())} { + // Force-initialize the process-wide ETWProvider singleton -- and its + // internal providers() map (lazily created on first ETWProvider method + // call) -- before this TracerProvider finishes constructing. Static + // destruction is reverse-of-construction order, so initializing both + // here guarantees that they outlive this provider's tracer cache: + // cached Tracer destructors call etwProvider().close(provHandle), and + // provHandle is a reference into that providers() map. Without this, + // either singleton could be destroyed first, leaving cached Tracer + // destructors with dangling state at process shutdown (#3849). + (void)Tracer::etwProvider().is_registered(std::string{}); + config_.enableTraceId = true; config_.enableSpanId = true; config_.enableActivityId = false; @@ -1152,6 +1178,31 @@ class TracerProvider : public opentelemetry::trace::TracerProvider config_.encoding = ETWProvider::EventFormat::ETW_MANIFEST; } + /** + * @brief Cache of Tracers handed out by GetTracer(), keyed by name. + * + * The ETW Span class stores its parent Tracer as a raw reference + * (Tracer &owner_); if the only strong refcount on the Tracer is the + * nostd::shared_ptr returned by GetTracer(), the Tracer is destroyed + * as soon as the caller drops it, leaving every Span derived from it + * with a dangling owner_ that crashes on End(). Cache here so the + * provider keeps the Tracer alive as long as it is alive itself, + * matching the lifetime contract used by sdk::trace::TracerProvider. + * + * Growth: this cache is unbounded by design and mirrors + * sdk::trace::TracerProvider's tracers_ vector. Eviction is unsafe + * because the ETW Span stores its parent Tracer by raw reference, so + * dropping a Tracer that still has outstanding Spans would + * re-introduce the use-after-free this cache exists to prevent. + * + * In practice the cache stays small because the ETW exporter is + * designed around "one Tracer per ETW provider name, reused for the + * process/component lifetime" -- and Windows itself caps the number + * of registered ETW providers per process well below any RAM concern. + */ + std::map> tracers_; + std::mutex tracers_lock_; + /** * @brief Obtain ETW Tracer. * @param name ProviderId (instrumentation name) - Name or GUID @@ -1177,9 +1228,19 @@ class TracerProvider : public opentelemetry::trace::TracerProvider UNREFERENCED_PARAMETER(args); UNREFERENCED_PARAMETER(schema_url); ETWProvider::EventFormat evtFmt = config_.encoding; - std::shared_ptr tracer{new (std::nothrow) - Tracer(*this, name, evtFmt)}; - return nostd::shared_ptr{tracer}; + // Cache Tracers by name. The exporter Span class holds a raw + // Tracer& as owner_, so the Tracer must live at least as long as + // the longest-lived Span derived from it. Without caching, each + // GetTracer() call mints a fresh heap Tracer that is destroyed the + // moment the caller drops the returned shared_ptr, orphaning any + // outstanding Spans (use-after-free in Span::End()). + std::lock_guard lock(tracers_lock_); + auto &cached = tracers_[std::string{name.data(), name.size()}]; + if (!cached) + { + cached.reset(new (std::nothrow) Tracer(*this, name, evtFmt)); + } + return nostd::shared_ptr{cached}; } }; diff --git a/exporters/etw/test/etw_tracer_test.cc b/exporters/etw/test/etw_tracer_test.cc index 7d32a2ffba..dce9d7ead5 100644 --- a/exporters/etw/test/etw_tracer_test.cc +++ b/exporters/etw/test/etw_tracer_test.cc @@ -392,12 +392,14 @@ TEST(ETWTracer, GlobalSingletonTracer) } */ - // Obtain a different tracer withs its own trace-id. + // GetTracer(name) is now idempotent per name: a second call with the same + // provider name returns the same cached Tracer (see #3849), so the trace + // id matches the global tracer's. auto localTracer = GetGlobalTracerProvider().GetTracer(kGlobalProviderName); auto s2 = localTracer->StartSpan("Span2"); auto traceId2 = s2->GetContext().trace_id(); s2->End(); -/* === Span 2 - "TraceId": "334bf9a1eed98d40a873a606295a9368" +/* === Span 2 - "TraceId": "182a64258fb1864ca4e1a542eecbd9bf" (same trace as Span 1) { "Timestamp": "2021-05-10T11:45:27.0289654-07:00", "ProviderName": "OpenTelemetry-ETW-TLD", @@ -418,7 +420,7 @@ TEST(ETWTracer, GlobalSingletonTracer) "StatusCode": 0, "StatusMessage": "", "Success": "True", - "TraceId": "334bf9a1eed98d40a873a606295a9368", + "TraceId": "182a64258fb1864ca4e1a542eecbd9bf", "_name": "Span" } } @@ -455,7 +457,7 @@ TEST(ETWTracer, GlobalSingletonTracer) } } */ - EXPECT_NE(traceId1, traceId2); + EXPECT_EQ(traceId1, traceId2); EXPECT_EQ(traceId1, traceId3); # if OPENTELEMETRY_ABI_VERSION_NO == 1