Skip to content

fix: improve reconnect recovery with async lock fix and immediate redo trigger#323

Open
cxhello wants to merge 2 commits into
nacos-group:masterfrom
cxhello:fix/improve-reconnect-recovery
Open

fix: improve reconnect recovery with async lock fix and immediate redo trigger#323
cxhello wants to merge 2 commits into
nacos-group:masterfrom
cxhello:fix/improve-reconnect-recovery

Conversation

@cxhello
Copy link
Copy Markdown
Member

@cxhello cxhello commented May 21, 2026

What type of PR is this?

  • Bug fix
  • New feature
  • Enhancement
  • Refactoring
  • Documentation
  • CI/CD
  • Other

What does this PR do?

Two small fixes to improve reconnect recovery after gRPC connection drops:

  1. Fix async lock bug in ConnectResetRequestHandler: The handler used synchronous with on an asyncio.Lock, which does not actually acquire the lock (asyncio.Lock.__enter__ is a no-op without await acquire()). Changed to async with to properly guard the status check and server switch when handling ConnectResetRequest from the server.

  2. Trigger redo immediately on reconnect: After gRPC reconnection, on_connected() only set self._connected = True. The redo task polled _execute_redo_channel with a 30-second timeout, so instance re-registration could be delayed up to 30 seconds. Now on_connected() also calls the existing start_redo_task() method to signal the channel immediately, waking the redo task to re-register instances without delay.

Which issue(s) does this PR fix?

Related to alibaba/nacos#14106

How has this been tested?

  • Code review and static analysis of the async lock usage
  • Traced the redo task flow to confirm start_redo_task()_execute_redo_channel.put(None)_execute_redo_task wakes up → checks _connected (already True) → calls redo_task() to re-register instances
  • Verified no other synchronous lock usages remain in rpc_client.py

Checklist

  • Unit tests added/updated
  • Documentation updated (if needed)
  • No breaking changes (or described above)

cxhello added 2 commits May 21, 2026 10:36
…tHandler

The handler used synchronous 'with' on an asyncio.Lock, which does not
actually acquire the lock. Change to 'async with' to properly guard the
status check and server switch when handling ConnectResetRequest.

Signed-off-by: cxhello <caixiaohuichn@gmail.com>
…p to 30s

After gRPC reconnection, on_connected() only set self._connected = True.
The redo task polled _execute_redo_channel with a 30-second timeout, so
instance re-registration could be delayed up to 30 seconds.

Now on_connected() also calls start_redo_task() to signal the channel
immediately, waking the redo task to re-register instances without delay.

Signed-off-by: cxhello <caixiaohuichn@gmail.com>
@cxhello cxhello requested a review from Sunrisea May 21, 2026 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant