Skip to content

feat(tracing): migrate from OpenTracing/Jaeger to OpenTelemetry#5456

Open
martinconic wants to merge 16 commits into
masterfrom
feat/opentelemetry-migration
Open

feat(tracing): migrate from OpenTracing/Jaeger to OpenTelemetry#5456
martinconic wants to merge 16 commits into
masterfrom
feat/opentelemetry-migration

Conversation

@martinconic

@martinconic martinconic commented May 11, 2026

Copy link
Copy Markdown
Contributor

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Migrates bee's tracing stack from the archived OpenTracing API and
deprecated jaeger-client-go to the OpenTelemetry Go SDK with an
OTLP exporter. The wrapper at pkg/tracing keeps a familiar API
shape so the call-site sweep is bounded to ~8 production files plus
the test suite.

Closes #5010.

⚠️ Breaking change for operators using tracing

The Jaeger UDP agent on port :6831 is no longer the receiver. Operators
must run an OTLP-capable collector (Jaeger v2, OpenTelemetry Collector, or any
backend with an OTLP receiver) and point bee at it.

The old tracing-endpoint / tracing-host / tracing-port keys are
deprecated, not rejected: a node that still carries them starts normally,
the values are ignored, and a warning is logged pointing operators at the new
tracing-otlp-* options. This keeps existing configs bootable across the
upgrade. The old values are not translated — operators relying on tracing must
still migrate to the tracing-otlp-* options. The deprecated keys are hidden
from --help and will be removed in a future release.

Configuration flags

Deprecated (accepted but ignored):

  • tracing-endpoint (replaced by tracing-otlp-endpoint)
  • tracing-host / tracing-port (folded into tracing-otlp-endpoint)

Added:

Flag Default Meaning
tracing-otlp-endpoint 127.0.0.1:4318 OTLP collector address host:port (default port is 4318 for http, 4317 for grpc)
tracing-otlp-insecure false Disable TLS for the OTLP exporter (suitable for a local collector). When false, set tracing-otlp-ca-file to verify the collector certificate against a private CA (otherwise the system root CAs are used)
tracing-otlp-ca-file "" Path to a PEM-encoded CA bundle used to verify the OTLP collector certificate; ignored when tracing-otlp-insecure=true
tracing-otlp-protocol http OTLP transport: http or grpc
tracing-sampling-ratio 1.0 Head-based sampling ratio in [0, 1]; 0 samples nothing, 1 samples everything

Unchanged: tracing-enable, tracing-service-name.

Nested YAML form

Tracing keys now also accept the nested form (mirroring blockchain-rpc.*):

tracing:
  enable: true
  otlp-endpoint: 127.0.0.1:4318
  otlp-insecure: true
  otlp-protocol: http
  sampling-ratio: 1.0
  service-name: bee

Flat tracing-* keys still work; if both are present in the same config file,
the nested form wins and a warning is logged.

Wire-format compatibility

Spans propagated across libp2p streams use a new versioned binary carrier, and
the standard W3C traceparent header is used over HTTP. Mixed-version peers
will start fresh traces at the boundary rather than continuing the old Jaeger
binary format.

Coordination

Requires a paired beekeeper update — the workflow at
.github/workflows/beekeeper.yml temporarily points at
BEEKEEPER_BRANCH: feat/bee-otlp-tracing-config and must be flipped back to
master before merging this PR (and after the beekeeper PR lands).

Motivation and Context (Optional)

OpenTracing is archived and jaeger-client-go is deprecated; OpenTelemetry is
the supported successor. Migrating now lets bee plug into any OTLP-capable
backend (Jaeger v2, OTel Collector, vendor agents) and unlocks adopting
semconv and tail-sampling at the collector layer in follow-up work.

AI Disclosure

  • This PR contains code that has been generated by an LLM.
  • I have reviewed the AI generated code thoroughly.
  • I possess the technical expertise to responsibly review the code generated in this PR.

@martinconic martinconic marked this pull request as ready for review May 11, 2026 15:48
@martinconic martinconic requested review from acud, akrem-chabchoub, gacevicljubisa, janos and sbackend123 and removed request for gacevicljubisa May 11, 2026 15:49

@sbackend123 sbackend123 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would explicitly call out in the PR description and in the docs which configuration flags were added / removed and that the tracing endpoint semantics changed. Without that, anyone still on the old configuration may be caught off guard.

Comment thread .github/workflows/beekeeper.yml Outdated
Comment thread pkg/tracing/tracing.go Outdated
Comment thread cmd/bee/cmd/cmd.go

@gacevicljubisa gacevicljubisa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would explicitly call out in the PR description and in the docs which configuration flags were added / removed and that the tracing endpoint semantics changed. Without that, anyone still on the old configuration may be caught off guard.

Yes, I agree. Updating the node with old config will result with "unknown flag" error.
Also, this is breaking change for the tracing for the node operators who are using tracing, they will need to switch to OTLP receiver, instead of jaeger UDP agent.

Additionally, OTel has a second exporter, otlptracegrpc. It might be worth to expose a tracing-otlp-protocol: http|grpc flag (defaulting to http), so operators can choose.

Comment thread cmd/bee/cmd/cmd.go Outdated
Comment thread pkg/tracing/tracing.go Outdated
@martinconic martinconic self-assigned this May 13, 2026
@martinconic martinconic added this to the 2026 milestone May 13, 2026
Comment thread pkg/tracing/tracing.go Outdated
Comment thread pkg/tracing/tracing.go Outdated
@martinconic martinconic requested a review from janos May 15, 2026 12:36

@gacevicljubisa gacevicljubisa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have suggested some change, comments are mostly suggestions, not blockers, but overal looks good. It would be great if certificate loading logic is tested and validated.

Also, please edit the PR desc, to match the implementation for tracing-otlp-insecure flag, and add tracing-otlp-ca-file to table.

Comment thread pkg/tracing/tracing.go Outdated
Comment thread pkg/tracing/tracing.go
Comment thread pkg/tracing/tracing.go Outdated
Comment thread pkg/tracing/tracing.go
Comment thread pkg/tracing/tracing.go Outdated

@gacevicljubisa gacevicljubisa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change overall look good. Configuration backwards compatibility is there.

But I think we should change the p2p.HeaderNameTracingSpanContext = "tracing-span-context-v2", because it can happen that "new nodes" will communicate with "old ones", and we should have different key becuase of the difference in the payload.

Comment thread pkg/tracing/tracing.go Outdated

@gacevicljubisa gacevicljubisa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before merging, it would be good to test the changes with OTEL collector (HTTP or GRPC), to check if traces are as we expect them. But overall it looks good.

@acud

acud commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@martinconic please let us know the results of manual testing here before merging

@darkobas2

Copy link
Copy Markdown
Contributor

🤖 AI-assisted review (Claude Code, with a Gemini second opinion). Suggestions for the visibility work — not blockers; the migration itself looks clean.

Trace visibility / observability — prioritized:

High — enrich the OTel resource (pkg/tracing/tracing.go, NewTracer)
The resource currently carries only service.name, so in any multi-node deployment every node's spans look identical in an OTLP backend — per-node trace search and the service graph collapse together. Worth adding:

  • service.instance.id = the node's overlay address — highest value; identifies which node a trace came from. Note NewTracer is constructed before the overlay is known in NewBee, so it'd need threading in (or constructing the tracer after the overlay is derived).
  • service.version (version + commit) — correlate trace behavior with releases.
  • deployment.environment — derivable from the network id.
  • resource.WithFromEnv() (+ WithTelemetrySDK/WithHost) — lets operators enrich via OTEL_RESOURCE_ATTRIBUTES (e.g. cluster/region) without a rebuild.

High/Med — batch span processor uses all defaults (WithBatcher(exporter))
Default queue (2048)/batch/timeout can silently drop spans under load on hot paths (pushsync/retrieval), especially with the default sampling ratio of 1.0 — and a dropped span gives no signal. Consider exposing max-queue/timeout knobs, or documenting that OTEL_BSP_* env vars are honored.

Med — harden the OTLP exporter (newOTLPClient)
Add WithCompression(gzip) (cheap bandwidth win at scale); optionally WithTimeout/WithRetry for resilience to collector blips.

Med — a few bare spans
pkg/storer/netstore.go get/put spans carry only success — add the chunk address. pingpong spans lack a peer-address attribute. (Error recording via RecordError/SetStatus across pushsync/retrieval/pusher/api is consistent — nicely done.)

Low / follow-up — no spans yet on postage/kademlia/salud (postage would be the most useful next add for an end-to-end "where did this upload go"); baggage isn't propagated (only W3C TraceContext) — adding it would let a key like batch_id follow an upload across hops.

Nice touches: the versioned binary p2p span carrier (clean rolling-upgrade story), the configurable parent-based sampler with correct cross-node sampled-flag propagation, and a genuinely zero-cost no-op path when tracing is disabled.

Overall solid and ready — the one I'd suggest landing with this PR is the resource-attribute set (service.instance.id / service.version / WithFromEnv), since it's what makes traces distinguishable per-node in the backend.

@gacevicljubisa

gacevicljubisa commented Jun 12, 2026

Copy link
Copy Markdown
Member

@martinconic maybe it would help if we have a log of info level that confirms tracing is actually enabled and wired up

@gacevicljubisa

Copy link
Copy Markdown
Member

@darkobas2 any suggestion regarding the flags for this setup? would it be better to leave the same name tracing-endpoint as it was till now, because currently in this PR, we are deprecating tracing-endpoint and introducing the new set of flags. Currently in this PR:

  • "tracing-otlp-endpoint"
  • "tracing-otlp-insecure"
  • "tracing-otlp-ca-file"
  • "tracing-otlp-protocol"
  • "tracing-sampling-ratio"

@darkobas2

Copy link
Copy Markdown
Contributor

@gacevicljubisa — I'd drop the otlp infix and go with a generic tracing-* family instead of tracing-otlp-*:

tracing-endpoint, tracing-insecure, tracing-ca-file, tracing-protocol, tracing-sampling-ratio

Why:

  • OTLP is a transport detail; the flag concept is just "where traces go". Generic names are future-proof if the transport ever changes again, and avoid the awkward mix of tracing-endpoint (no prefix) sitting next to tracing-otlp-protocol (prefixed).
  • Keeps the familiar tracing-endpoint name — less doc/muscle-memory churn.

The one catch to handle: tracing-endpoint changes meaning — it used to be a Jaeger agent address (UDP, e.g. :6831), now it's an OTLP collector endpoint (gRPC :4317 / HTTP :4318). Same name, different required value = silent-breakage risk (a node keeps starting but points OTLP at a dead Jaeger port and quietly drops traces). So if we reuse the name, please pair it with:

  1. A startup info log confirming tracing is wired up — tracing enabled → endpoint=… protocol=… sampling=…, plus a clear error if the exporter can't reach the endpoint. (This also covers your other comment about confirming tracing is actually enabled.)
  2. A migration note in the changelog/release notes (Jaeger UDP → OTLP; ports change).

That keeps the naming clean and makes the semantic change loud instead of silent. No strong opinion on the tracing-protocol default (grpc vs http) — either's fine; http/4318 is marginally easier to front with standard proxies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs: use of opentracing spans and jaeger with bee

7 participants