Skip to content

fix: Remove initial-exec TLS storage to enable dlopen#522

Open
r1viollet wants to merge 4 commits intomainfrom
r1viollet/fix/loader-rtld-global-constructor
Open

fix: Remove initial-exec TLS storage to enable dlopen#522
r1viollet wants to merge 4 commits intomainfrom
r1viollet/fix/loader-rtld-global-constructor

Conversation

@r1viollet
Copy link
Collaborator

@r1viollet r1viollet commented Mar 23, 2026

What does this PR do?

This is a revert of the following PR in favour of changes proposed here.

Fixes the failure reported in #520 / #521: libdd_profiling.so fails to load its embedded .so when opened via dlopen.
Musl hard-rejects initial-exec TLS cross-library references when the defining library is dynamically loaded (ldso/dynlink.c:456: tls_id > static_tls_cnt), regardless of symbol visibility.

Motivation

Allowing dlopen use case.

How

Replace direct initial-exec TLS storage of TrackerThreadLocalState with pthread_key_t + pthread_getspecific/pthread_setspecific:

  • AllocationTracker::_tl_state_key is an std::atomic<pthread_key_t>, initialized to a sentinel kInvalidKey
  • ensure_key_initialized() creates the key once via CAS — called from allocation_tracking_init()
  • get_tl_state() is now a single pthread_getspecific call with no initialized flag check (null pointer is the sentinel)
  • init_tl_state() heap-allocates the state and registers it with pthread_setspecific; the destructor is called automatically on thread exit

This eliminates the cross-library TLS reference entirely, so the embedded .so no longer requires any symbol from the loader at load time. The RTLD_NOLOAD self-promotion mechanism (which also used dladdr, unavailable in the -nolibc universal build) is removed as dead code.

How to test the change?

A new test loader_rtld_global loads libdd_profiling.so via dlopen(..., RTLD_GLOBAL) and calls ddprof_start_profiling(). Without the fix it returns -1 (embedded library not loaded). With the fix it passes on both glibc (Ubuntu 24.04) and musl (Alpine 3.22).

Performance

Claude measured before and after accessing TLS state
Benchmark results: pthread_getspecific (branch) vs __thread initial-exec (main)

  ┌────────────────────────┬───────────────────────────────────────────┬─────────┬─────────┬───────────────────────────────┐                                                                                
  │       Benchmark        │         What one iteration covers         │  main   │ branch  │        Per-call delta         │
  ├────────────────────────┼───────────────────────────────────────────┼─────────┼─────────┼───────────────────────────────┤                                                                                
  │ BM_GetTlState          │ 4 threads × 10k get_tl_state() calls      │ 10.3 µs │ 27.1 µs │ +1.7 ns/call                  │
  ├────────────────────────┼───────────────────────────────────────────┼─────────┼─────────┼───────────────────────────────┤
  │ BM_ShortLived_Tracking │ 4 short-lived threads × 2k alloc/free ops │ 208 µs  │ 220 µs  │ ~1.5 ns/call (consistent)     │                                                                                
  ├────────────────────────┼───────────────────────────────────────────┼─────────┼─────────┼───────────────────────────────┤                                                                                
  │ BM_LongLived_Tracking  │ 4 persistent threads × 2k alloc/free ops  │ 315 µs  │ 315 µs  │ 0 (TLS lost in sampling work) │                                                                                
  └────────────────────────┴───────────────────────────────────────────┴─────────┴─────────┴───────────────────────────────┘                                                                                

So the TLS access is slower. Considering this is still O(nanoseconds) in the hot path, I hope this does not impact performance significantly.

@r1viollet
Copy link
Collaborator Author

WIP: Still some alpine / glibc differences to consider

@r1viollet
Copy link
Collaborator Author

There is an issue with the initial-exec + musl dlopen that needs a bigger change

@pr-commenter
Copy link

pr-commenter bot commented Mar 23, 2026

Benchmark results for collatz

Parameters

Baseline Candidate
config baseline candidate
profiler-version ddprof 0.24.0+bf23f795.103772975 ddprof 0.24.0+e8caf60b.104153116

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 1 metrics, 0 unstable metrics.

See unchanged results
scenario Δ mean execution_time
scenario:ddprof -S bench-collatz --preset cpu_only collatz_runner.sh same

@pr-commenter
Copy link

pr-commenter bot commented Mar 23, 2026

Benchmark results for BadBoggleSolver_run

Parameters

Baseline Candidate
config baseline candidate
profiler-version ddprof 0.24.0+bf23f795.103772975 ddprof 0.24.0+e8caf60b.104153116

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 1 metrics, 0 unstable metrics.

See unchanged results
scenario Δ mean execution_time
scenario:ddprof -S bench-bad-boggle-solver BadBoggleSolver_run work 1000 same

…er state

The embedded .so referenced ddprof_lib_state (defined in the loader) using
the initial-exec TLS model. On musl, initial-exec TLS cross-library references
are rejected when the defining library is loaded via dlopen, because musl
freezes static TLS slot allocation at program startup (ldso/dynlink.c:456).

Switch TrackerThreadLocalState storage to pthread_key_t + pthread_getspecific:
- Eliminates the cross-library initial-exec TLS reference entirely
- get_tl_state() returns nullptr when key is uninitialized (no initialized flag)
- init_tl_state() heap-allocates TrackerThreadLocalState via new/pthread_setspecific
- ensure_key_initialized() uses CAS for thread-safe one-time key creation
- Remove ddprof_lib_state from loader.c (no longer referenced by embedded .so)
- Remove tls_state_storage.h include (constants no longer needed)
- loader_rtld_global test now passes on both glibc and musl/Alpine

Co-Authored-By: Xavier Roche <xroche@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@r1viollet r1viollet force-pushed the r1viollet/fix/loader-rtld-global-constructor branch from 27d1c7a to e16987f Compare March 23, 2026 14:12
- Update loader_rtld_global_test.c comment to accurately describe both
  failure modes (glibc RTLD_GLOBAL timing and musl initial-exec TLS
  rejection) and the pthread_key_t fix
- Add comment on kInvalidKey sentinel explaining the POSIX assumption
- Add comment in free() explaining why pthread_key_delete is skipped
  (race with concurrent get_tl_state() callers)
- Document why ensure_key_initialized() is public (fork test needs it
  without a full ring-buffer init)
- Add BM_GetTlState microbenchmark to measure raw TLS access overhead
  (pthread_getspecific vs initial-exec __thread)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@r1viollet r1viollet marked this pull request as ready for review March 24, 2026 11:22
r1viollet and others added 2 commits March 24, 2026 12:29
…initialized() release

relaxed load could transiently return kInvalidKey on a thread that races
with ensure_key_initialized(), causing init_tl_state() to return nullptr
and silently drop all initial allocations for that thread.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@r1viollet r1viollet changed the title fix: promote loader symbols to global scope before loading embedded .so fix: Remove initial-exec TLS storage to enable dlopen Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants