test: fix intermittent IS lock timeouts by syncing mempools before waiting#7241
test: fix intermittent IS lock timeouts by syncing mempools before waiting#7241UdjinM6 wants to merge 4 commits intodashpay:developfrom
Conversation
…iting When mocktime is set, NodeClock::now() returns a frozen value. After bump_mocktime(), the trickle relay timer fires once, then m_next_inv_send_time is set ahead of the frozen clock and never fires again. If a TX inv misses that single relay window, it stays stuck in the sender's mempool indefinitely — regardless of real-time elapsed. Add sync_mempools() after bump_mocktime() in all test paths that send a TX and immediately wait for an IS lock. This ensures all nodes (especially MNs that need to sign) have the TX before polling begins. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
✅ No Merge Conflicts DetectedThis PR currently has no conflicts with other open PRs. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughThe test framework's wait_for_instantlock was generalized from a single txid/node call to accept multiple txids, an explicit nodes list, a timeout, and a skip_sync flag. The implementation now optionally bumps mock time and syncs mempools before polling, and it requires InstantLock=true for all provided txids on each node in the nodes list. Multiple functional tests were updated to remove per-node wait loops, individual mock-time bumps, and some mempool syncs, replacing them with centralized calls to the new wait_for_instantlock signature (sometimes using skip_sync=True). Sequence Diagram(s)sequenceDiagram
participant TH as TestHarness
participant N as Node(s)
participant M as Mempools
participant IS as InstantSend/Quorum
TH->>N: send/broadcast tx(s)
alt skip_sync is False
TH->>TH: wait_for_instantlock(*txids, nodes=..., timeout, skip_sync=False)
TH->>N: bump_mocktime(+30s)
TH->>M: sync_mempools() rgba(0,128,255,0.5)
else skip_sync is True
TH->>TH: wait_for_instantlock(*txids, nodes=..., skip_sync=True)
end
N->>IS: propagate tx(s)
IS-->>N: instantlock produced
TH->>N: poll getrawtransaction/getmempool for InstantLock on all txids/nodes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 62d47f7476
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # 3 nodes should be enough to create an IS lock even if nodes 4 and 5 (which have no tx itself) | ||
| # are the only "neighbours" in intra-quorum connections for one of them. | ||
| self.bump_mocktime(30) | ||
| self.sync_mempools(self.nodes[:3]) |
There was a problem hiding this comment.
Sync three masternodes before waiting for the normal IS lock
In this test node 0 is only the control node while nodes 1-4 are the masternodes (feature_llmq_is_retroactive.py:11-13), and mine_cycle_quorum() uses the DIP0024 test quorum, which needs 3 members to sign. self.nodes[:3] only covers node0, node1, and node2, so this wait guarantees the tx reached at most two signers. If node3 is the third quorum member and it misses the relay window, wait_for_instantlock() can still time out here, leaving the intermittent failure in the main "trying normal IS lock" path.
Useful? React with 👍 / 👎.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
The PR correctly fixes the identified intermittent IS lock timeouts by replacing wait_for_tx with sync_mempools, which auto-advances mocktime during polling and thus unfreezes the trickle relay timer. However, the same unpatched send→bump_mocktime→wait_for_instantlock pattern exists in feature_llmq_is_cl_conflicts.py (no whitelist, no mitigation), and the now-unused wait_for_tx method should be removed.
Reviewed commit: 62d47f7
🟡 2 suggestion(s) | 💬 1 nitpick(s)
3 additional findings
🟡 suggestion: Same unpatched send→bump_mocktime→wait_for_instantlock pattern
test/functional/feature_llmq_is_cl_conflicts.py (lines 244-248)
This test has no -whitelist on any node, yet follows the exact pattern the PR is fixing: sendrawtransaction(rawtx2) → bump_mocktime(60) → wait_for_instantlock on all nodes. Under frozen mocktime, the trickle relay timer fires once after the bump then freezes, so rawtx2 may never reach masternodes on other nodes. The larger bump (60 vs 30) doesn't help because m_next_inv_send_time still gets set ahead of the frozen clock after the single fire. sync_mempools() would fix this because it auto-advances mocktime by 3s on each poll iteration (line 854-855 of test_framework.py), repeatedly unfreezing the relay timer.
💡 Suggested change
self.nodes[0].sendrawtransaction(rawtx2)
# bump mocktime to force tx relay
self.bump_mocktime(60)
self.sync_mempools()
for node in self.nodes:
self.wait_for_instantlock(rawtx2_txid, node)
🟡 suggestion: Same pattern in ZMQ test at three locations (lines 298, 351, 385)
test/functional/interface_zmq_dash.py (lines 298-300)
Three instances of send→bump_mocktime(30)→wait_for_instantlock without sync_mempools exist in this file. Node 0 has -whitelist=127.0.0.1 (line 120), which bypasses trickle relay for relay from node 0 to directly connected peers, significantly reducing the risk. However, the PR itself adds sync_mempools to feature_llmq_is_retroactive.py which also has -whitelist=127.0.0.1 on node 0, suggesting whitelist alone may not be fully sufficient (e.g., if relay between masternodes over intra-quorum connections is needed). Worth considering for consistency, but lower priority since the whitelist provides substantial mitigation.
💬 nitpick: wait_for_tx is now dead code
test/functional/feature_llmq_is_retroactive.py (lines 39-46)
All four call sites of wait_for_tx have been replaced with sync_mempools. Grep confirms the method name only appears at its definition (line 39). It should be removed to avoid confusion.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `test/functional/feature_llmq_is_cl_conflicts.py`:
- [SUGGESTION] lines 244-248: Same unpatched send→bump_mocktime→wait_for_instantlock pattern
This test has no `-whitelist` on any node, yet follows the exact pattern the PR is fixing: `sendrawtransaction(rawtx2)` → `bump_mocktime(60)` → `wait_for_instantlock` on all nodes. Under frozen mocktime, the trickle relay timer fires once after the bump then freezes, so rawtx2 may never reach masternodes on other nodes. The larger bump (60 vs 30) doesn't help because `m_next_inv_send_time` still gets set ahead of the frozen clock after the single fire. `sync_mempools()` would fix this because it auto-advances mocktime by 3s on each poll iteration (line 854-855 of test_framework.py), repeatedly unfreezing the relay timer.
In `test/functional/interface_zmq_dash.py`:
- [SUGGESTION] lines 298-300: Same pattern in ZMQ test at three locations (lines 298, 351, 385)
Three instances of send→bump_mocktime(30)→wait_for_instantlock without sync_mempools exist in this file. Node 0 has `-whitelist=127.0.0.1` (line 120), which bypasses trickle relay for relay *from* node 0 to directly connected peers, significantly reducing the risk. However, the PR itself adds sync_mempools to `feature_llmq_is_retroactive.py` which also has `-whitelist=127.0.0.1` on node 0, suggesting whitelist alone may not be fully sufficient (e.g., if relay between masternodes over intra-quorum connections is needed). Worth considering for consistency, but lower priority since the whitelist provides substantial mitigation.
…ead code Add sync_mempools() to two more tests with the same vulnerable pattern: - feature_llmq_is_cl_conflicts.py: no whitelist, waits on all nodes - interface_zmq_dash.py: three locations (low risk due to whitelist, but fixed for consistency) Remove dead wait_for_tx() method from feature_llmq_is_retroactive.py whose call sites were all replaced with sync_mempools in the prior commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
thepastaclaw
left a comment
There was a problem hiding this comment.
Incremental Review (62d47f7 → a0e4260)
Reviewed the incremental changes — no new issues found.
Changes reviewed:
feature_llmq_is_cl_conflicts.py—sync_mempools()correctly added betweenbump_mocktime(60)and thewait_for_instantlockloop at line 247feature_llmq_is_retroactive.py— removedwait_for_txmethod confirmed as dead code (no remaining callers)interface_zmq_dash.py— three newsync_mempools()calls at lines 300, 354, 389 follow the establishedbump_mocktime → sync_mempools → wait_for_instantlockpattern
The single-node test (feature_llmq_singlenode.py) correctly omits sync_mempools() as there's nothing to sync.
✅ All changes look correct.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c79c2d981d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
The refactoring cleanly centralizes the bump+sync+wait pattern into wait_for_instantlock and all callers are correctly converted to the new variadic signature. Prior findings (missing sync in feature_llmq_is_cl_conflicts.py, dead wait_for_tx code) are fully addressed. However, the nodes parameter now conflates sync scope with check scope: callers passing nodes=[single_node] make sync_mempools a no-op (one node is trivially synced with itself), potentially reintroducing the propagation gap this PR aims to fix.
Reviewed commit: 003e50b
🟡 3 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `test/functional/test_framework/test_framework.py`:
- [SUGGESTION] lines 1949-1954: `nodes` parameter conflates sync scope with check scope — single-node callers skip effective mempool sync
The `nodes` parameter controls both which nodes are passed to `sync_mempools(nodes)` and which nodes are checked for the IS lock. When callers pass `nodes=[self.nodes[0]]`, `sync_mempools` with a single node returns immediately (trivially satisfied — one node's mempool always equals itself). No multi-hop mocktime bumping occurs.
The original code at sites like `rpc_verifyislock.py:44` and `interface_zmq_dash.py:384` did `sync_mempools()` (all nodes), which kept bumping mocktime by 3s each iteration until full propagation. Since `check_instantlock`'s `wait_until` loop does NOT bump mocktime, quorum members that didn't receive the tx in the single `bump_mocktime(30)` relay window will never get it — the exact failure mode this PR addresses.
Consider always syncing all connected nodes' mempools regardless of which nodes are checked for the IS lock — e.g., `self.sync_mempools()` instead of `self.sync_mempools(nodes)`.
- [SUGGESTION] lines 1958-1968: Combined `all()` check gives multi-tx callers a shared timeout budget instead of per-pair budgets
The refactored helper waits for `all(node.getrawtransaction(txid, True)["instantlock"] for txid in txids for node in nodes)` inside a single `wait_until(..., timeout=60)`. Before this change, call sites like `feature_llmq_is_cl_conflicts.py` waited each tx/node pair independently, so each pair got its own 60-second budget. Now `self.wait_for_instantlock(rawtx1_txid, rawtx4_txid)` gives every node and both txids only one shared 60-second window.
In practice this is unlikely to cause issues since IS locks typically arrive within seconds of each other, but it's a semantic change worth noting. If timing becomes a problem on slow CI, consider iterating per-txid inside the helper.
In `test/functional/interface_zmq_dash.py`:
- [SUGGESTION] line 384: `nodes=[self.nodes[0]]` drops all-node sync that was in original code
The original code did `self.sync_mempools()` (all 6 nodes) then `self.wait_for_instantlock(collateral, self.nodes[0])`. The new code does `self.wait_for_instantlock(collateral, nodes=[self.nodes[0]])` which only syncs node0's mempool (a no-op). The governance collateral tx needs to reach quorum members for the IS lock to form. Simplest fix: use the default `nodes=self.nodes`.
…[0]]
wait_for_instantlock(nodes=[self.nodes[0]]) makes the internal
sync_mempools a no-op since it only syncs the sender with itself.
The MN signer nodes still need the TX, so drop the nodes= restriction
to sync all nodes including dynamically added masternodes.
Fixes: feature_llmq_singlenode.py (2 sites), feature_notifications.py,
interface_zmq_dash.py (governance collateral)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
003e50b to
d8ac781
Compare
Incremental Review (003e50b → d8ac781)Reviewed the incremental changes — no new issues found. Changes reviewed: Single commit removes Prior findings resolved:
Reviewed commit: d8ac781 |
Issue being fixed or feature implemented
When mocktime is set,
NodeClock::now()returns a frozen value. Afterbump_mocktime(), the trickle relay timer fires once, thenm_next_inv_send_timeis set ahead of the frozen clock and never fires again. If a TX inv misses that single relay window, it stays stuck in the sender's mempool indefinitely — regardless of real-time elapsed.What was done?
Add
sync_mempools()afterbump_mocktime()in all test paths that send a TX and immediately wait for an IS lock. This ensures all nodes (especially MNs that need to sign) have the TX before polling begins.How Has This Been Tested?
Run tests, should no longer fail in these parts
Breaking Changes
n/a
Checklist: