Skip to content

Make the retained dep graph deterministic under the parallel frontend#157352

Open
xmakro wants to merge 2 commits into
rust-lang:mainfrom
xmakro:fix/parallel-dep-graph-determinism
Open

Make the retained dep graph deterministic under the parallel frontend#157352
xmakro wants to merge 2 commits into
rust-lang:mainfrom
xmakro:fix/parallel-dep-graph-determinism

Conversation

@xmakro
Copy link
Copy Markdown

@xmakro xmakro commented Jun 3, 2026

Part of #154314.

The in-memory dep graph kept for -Zquery-dep-graph is built by GraphEncoder::record, which pushed each node using try_lock and silently dropped the node when the lock was contended. Single-threaded the only contention is a query forced re-entrantly from within with_retained_dep_graph, so dropping was harmless. Under the parallel frontend several encoder threads record nodes at the same time, so a contended try_lock dropped nodes and edges at random, leaving the retained graph nondeterministic. The dep-graph ui tests, which assert on its contents, then failed intermittently.

The reentrancy came from check_paths emitting its diagnostics while holding the retained graph: the error path calls def_path_str, which can create new dep nodes and re-enter the encoder. This reads the path results into owned values while the lock is held and emits the diagnostics afterwards, so record never re-enters while the lock is held. record can then block on the lock unconditionally instead of using try_lock, so concurrent encoders never drop a node or edge.

The retained graph has only one other reader, dump_graph (-Zdump-dep-graph), which also holds the lock while it formats nodes. Formatting a DepNode goes through its Debug impl, which uses def_path_debug_str; that path is deliberately query-free (it reads the untracked definitions and crate store), so it never records a node and cannot re-enter record. Concurrent encoder threads blocking on the lock is ordinary contention rather than a deadlock, since a thread holding the lock never waits on another thread to record.

This only affects compilation with -Zquery-dep-graph; with the flag off the retained graph is absent and nothing changes.

The 7 dep-graph ui tests are re-enabled under the parallel frontend. Validated with the directives neutralized: all 7 fail at --parallel-frontend-threads=16 --iteration-count=30; with the fix they pass at 300 iterations, single-threaded exact comparison, and across tests/incremental.

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 3, 2026
@xmakro xmakro force-pushed the fix/parallel-dep-graph-determinism branch from 0a6b168 to e4aedca Compare June 3, 2026 03:04
The in-memory dep graph kept for `-Zquery-dep-graph` is built by
`GraphEncoder::record`, which pushed each node using `try_lock` and
silently dropped the node when the lock was contended. Single-threaded
the only contention is a query forced re-entrantly from within
`with_retained_dep_graph`, so dropping was harmless. Under the parallel
frontend several encoder threads record nodes at the same time, so a
contended `try_lock` dropped nodes and edges at random, leaving the
retained graph nondeterministic and making the dep-graph ui tests, which
assert on its contents, fail intermittently.

Remove that reentrancy in `check_paths`: read the graph into owned
results while the lock is held, then emit the diagnostics afterwards.
The only query forced inside the old closure was `def_path_str` (for the
error messages), which can create dep nodes and re-enter the encoder;
doing the emission after the lock is released means `record` never
re-enters while the lock is held. `record` can then block on the lock
unconditionally instead of using `try_lock`, so concurrent encoders
never drop a node or edge.

This only affects compilation with `-Zquery-dep-graph`; with the flag
off the retained graph is absent and nothing changes.
@xmakro xmakro force-pushed the fix/parallel-dep-graph-determinism branch from e4aedca to 4f45cde Compare June 3, 2026 03:19
@xmakro xmakro marked this pull request as ready for review June 3, 2026 04:30
@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 3, 2026
@rustbot rustbot removed the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Jun 3, 2026
@rustbot
Copy link
Copy Markdown
Collaborator

rustbot commented Jun 3, 2026

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @Kivooeo (or someone else) some time within the next two weeks.

Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (S-waiting-on-review and S-waiting-on-author) stays updated, invoking these commands when appropriate:

  • @rustbot author: the review is finished, PR author should check the comments and take action accordingly
  • @rustbot review: the author is ready for a review, this PR will be queued again in the reviewer's queue
Why was this reviewer chosen?

The reviewer was selected based on:

  • Owners of files modified in this PR: compiler
  • compiler expanded to 73 candidates
  • Random selection from 17 candidates

@zetanumbers
Copy link
Copy Markdown
Contributor

You see the problem is that we are trying to keep dep graph's reflexive-transitive closure confluent. As such for parallel frontend we need to emit errors while holding a mutex lock. However looking at your PR made me think of a different way of keeping this property. You can sort the result vector to have an optimal operation order.

@Kivooeo
Copy link
Copy Markdown
Member

Kivooeo commented Jun 3, 2026

I'm not a good reviewer for that

r? nnethercote

@rustbot rustbot assigned nnethercote and unassigned Kivooeo Jun 3, 2026
@petrochenkov petrochenkov self-assigned this Jun 3, 2026
These tests assert on the contents of the retained dep graph, which is now
built deterministically under the parallel frontend, so they no longer need
to be ignored there. Replace the ignore-parallel-frontend directives with
blank lines rather than deleting them, so the following line numbers stay the
same and the expected output is unchanged.
@xmakro xmakro force-pushed the fix/parallel-dep-graph-determinism branch from 4f45cde to 6238c17 Compare June 3, 2026 15:54
@xmakro
Copy link
Copy Markdown
Author

xmakro commented Jun 3, 2026

You see the problem is that we are trying to keep dep graph's reflexive-transitive closure confluent. As such for parallel frontend we need to emit errors while holding a mutex lock. However looking at your PR made me think of a different way of keeping this property. You can sort the result vector to have an optimal operation order.

Thanks for the comment. My understanding is that assert_dep_graph is single-threaded. It runs in save_dep_graph before the par_join, and after all queries have finished, so nothing else is recording into the retained graph by then. But what could happen is reentrancy where check_paths holds the lock, emits an error, that calls def_path_str, which creates a dep node and re-enters record on the same thread, which then blocks on the lock it's already holding. So I collect the path results under the lock and emit afterwards. I don't think the emit order matters for the closure here, since every has_path is decided under the lock before anything is printed and nothing reads reachability afterwards, is this correct? The other option I see to make this more clear is to move the graph out of the lock, then we can avoid the lock during this phase, and keep the emit inside the for loop.

The important part is that the record() is called concurrently during analysis and codegen, and on a failed try_lock, the old code skipped the push, and then dropped that node/edge from the retained graph. So I think the change to lock() is needed. And then the other change to store items in the results vector becomes the follow up to make sure we don't deadlock on reentrancy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants