Skip to content

Signal before unlocking proxy context mutex in emscripten_proxy_finish#26582

Open
dschuff wants to merge 6 commits intoemscripten-core:mainfrom
dschuff:proxy_cond
Open

Signal before unlocking proxy context mutex in emscripten_proxy_finish#26582
dschuff wants to merge 6 commits intoemscripten-core:mainfrom
dschuff:proxy_cond

Conversation

@dschuff
Copy link
Copy Markdown
Member

@dschuff dschuff commented Mar 31, 2026

When finishing a proxied call, the following race condition can happen:

thread 1 in emscripten_proxy_finish:

    pthread_mutex_lock(&ctx->sync.mutex);
    ctx->sync.state = DONE;
    remove_active_ctx(ctx);
    pthread_mutex_unlock(&ctx->sync.mutex);
--> thread is preempted or suspends here <---
    pthread_cond_signal(&ctx->sync.cond);

thread 2 in emscripten_proxy_sync_with_ctx:
(ctx is on this thread's stack)

  pthread_mutex_lock(&ctx.sync.mutex); <-- locks after unlock above
  while (ctx.sync.state == PENDING) { <--- reads sync.state == DONE
    pthread_cond_wait(&ctx.sync.cond, &ctx.sync.mutex); <-- doesn't run
  }
  pthread_mutex_unlock(&ctx.sync.mutex);
  int ret = ctx.sync.state == DONE;
  em_proxying_ctx_deinit(&ctx); <--- frees ctx and returns

Then thread 1 tries to run pthread_cond_signal on the freed ctx.

This same logic applies to cancel_ctx which is also called on the target
thread.

This may be what is causing flake in the pselect/ppoll tests on the CI
waterfall.

When finishing a proxied call, the following race condition can happen:

thread 1 in emscripten_proxy_finish:
    pthread_mutex_lock(&ctx->sync.mutex);
    ctx->sync.state = DONE;
    remove_active_ctx(ctx);
    pthread_mutex_unlock(&ctx->sync.mutex);
--> thread is preempted or suspends here <---
    pthread_cond_signal(&ctx->sync.cond);

thread 2 in emscripten_proxy_sync_with_ctx:
(ctx is on this thread's stack)
  pthread_mutex_lock(&ctx.sync.mutex); <-- locks after unlock above
  while (ctx.sync.state == PENDING) { <--- reads sync.state == DONE
    pthread_cond_wait(&ctx.sync.cond, &ctx.sync.mutex); <-- doesn't run
  }
  pthread_mutex_unlock(&ctx.sync.mutex);
  int ret = ctx.sync.state == DONE;
  em_proxying_ctx_deinit(&ctx); <--- frees ctx and returns

Then thread 1 tries to run pthread_cond_signal on the freed ctx.

This may be what is causing flake in the pselect/ppoll tests on the CI
waterfall.
@dschuff dschuff requested review from sbc100 and tlively March 31, 2026 16:58
@sbc100
Copy link
Copy Markdown
Collaborator

sbc100 commented Mar 31, 2026

The same pattern is repeated in cancel_ctx.

For some reason I think this must have been deliberate pattern by @tlively when he wrote this code in #18810?

@sbc100
Copy link
Copy Markdown
Collaborator

sbc100 commented Mar 31, 2026

I'm suspicious of the root cause analysis of the AI, even though this does indeed look like a real bug.

@sbc100
Copy link
Copy Markdown
Collaborator

sbc100 commented Mar 31, 2026

The reason I'm suspicious is that I can't see how addition of the ppoll / pselect tests would tickle this bug in a way that existings tests for poll and select do not.

@sbc100
Copy link
Copy Markdown
Collaborator

sbc100 commented Mar 31, 2026

I think the rational here might be : "In many cases, it is desirable to signal after unlocking the mutex to avoid the "woken" thread immediately blocking again while trying to acquire the lock that the signaler still holds."

However, this might not be safe in the this case due to the deallocation sequence.

@sbc100
Copy link
Copy Markdown
Collaborator

sbc100 commented Mar 31, 2026

For the codesize failure I think you might just need to rebase.

@dschuff
Copy link
Copy Markdown
Member Author

dschuff commented Mar 31, 2026

I think the rational here might be : "In many cases, it is desirable to signal after unlocking the mutex to avoid the "woken" thread immediately blocking again while trying to acquire the lock that the signaler still holds."

However, this might not be safe in the this case due to the deallocation sequence.

Yeah that makes sense. I am also not sure whether this is affecting the current flake, but it looked like enough of a real possibility that it seemed worth doing. I'll check the other places in this code too.

@dschuff
Copy link
Copy Markdown
Member Author

dschuff commented Mar 31, 2026

I think we need to update cancel_ctx too, since it's called on the target thread, and could race the same way.

@sbc100
Copy link
Copy Markdown
Collaborator

sbc100 commented Mar 31, 2026

LGTM, but I'm curious to see what @tlively has to say about the optimization here.

@dschuff
Copy link
Copy Markdown
Member Author

dschuff commented Mar 31, 2026

FTR, it looks like #26586 is more likely the cause of the flake, but I still think the reasoning here is sound and we should apply this fix.

@tlively
Copy link
Copy Markdown
Member

tlively commented Mar 31, 2026

I don't think there is a problem here. The actual communication happens via ctx->sync.state = DONE;, which is correctly done before releasing the mutex. When thread two reads DONE, it knows it's safe to continue and does not need to wait for anything. Thread one still fires off a pthread_cond_signal, but it's ok that it doesn't wake anything up in this case; thread 2 is already safely proceeding with the work it would have started after receiving the notification.

@sbc100
Copy link
Copy Markdown
Collaborator

sbc100 commented Mar 31, 2026

We chatted offline and agreed (IIRC) that there is an issue here. I think @tlively is going to look into a solution that maybe keep the performance optimization here?

@dschuff
Copy link
Copy Markdown
Member Author

dschuff commented Apr 1, 2026

We actually decided it wasn't worth trying to engineer something more performant right now. The problem can only happen if the proxying thread can wake up and try to acquire the lock faster than the target thread can continue to the next statement and unlock the lock. Which can realistically only happen when the target thread gets suspended at the wrong time (actually this is exactly the same situation where we have the UAF in the current code, and we don't know of any real problems caused by this currently).

dschuff added 2 commits April 1, 2026 00:14
This is an automatic change generated by tools/maint/rebaseline_tests.py.

The following (2) test expectation files were updated by
running the tests with `--rebaseline`:

```
codesize/test_codesize_minimal_pthreads.json: 26409 => 26409 [+0 bytes / +0.00%]
codesize/test_codesize_minimal_pthreads_memgrowth.json: 26812 => 26812 [+0 bytes / +0.00%]

Average change: +0.00% (+0.00% - +0.00%)
```
Copy link
Copy Markdown
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 byte code size saving too! lets go!

@dschuff dschuff enabled auto-merge (squash) April 1, 2026 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants