Avoid false-positive settlement cancellations during node receipt lag#4510
Avoid false-positive settlement cancellations during node receipt lag#4510squadgazzz wants to merge 11 commits into
Conversation
4c41108 to
87d39e9
Compare
6924402 to
cd96b8e
Compare
cd96b8e to
e7efe37
Compare
|
Claude finished @squadgazzz's task in 3m 6s —— View job PR Review
The diagnosis (Avalanche reports a new block before its receipt index catches up; re-simulation reverts on the already-mined tx and the driver cancels it) matches what's been seen in logs, and the fix — looking up the Correctness
Performance
Nits
Tests
VerdictApprove with the cancel-on-double-error fix (item 1). Items 3, 6, and the chain-specific comment (5) are quick wins worth bundling in. |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to handle submitted settlements whose receipts are not yet available by querying the Settlement event on both latest and pending blocks to classify the re-simulation outcome. The reviewer identified a critical issue where many Ethereum JSON-RPC providers do not support eth_getLogs on the pending block tag, which would cause the query to fail and lead to false-positive transaction cancellations. To address this, the reviewer suggested updating classify_resimulation to accept the Result of the pending query and check for known false-positive reverts (like GPv2: order filled) when the pending log query fails, along with updating the corresponding unit tests.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
@claude review |
|
Claude finished @squadgazzz's task in 3m 59s —— View job PR Review
The latest revision is a solid improvement over the version I reviewed earlier. Pinning the cancel decision to the signer's A few things I'd want addressed before merging: Correctness
Carry-over from the previous review
Tests / coverage
Nits
VerdictThe cancel logic is correctly hardened — items 1 (block-pinning the log query) and 4 (docstring swap) are the two I'd want before merging. Items 2, 3, 5, 6 are quick-win polish; the rest are nits. |
|
Claude finished @squadgazzz's task in 2m 25s —— View job PR Review
The latest revision lands all the blocking items from the prior reviews:
Nice cleanup. A few non-blocking things I'd consider folding in: Non-blocking observations
VerdictLooks good to merge as-is. Items 1 and 4 are worth a follow-up; 2 and 3 are pure optimisations that can wait. |
There was a problem hiding this comment.
Code Review
This pull request updates the mempool transaction monitoring logic to prevent false-positive simulation reverts. It does this by checking if a transaction has already been mined (via eth_getLogs) or is still queued in the mempool (via pending nonce) before attempting cancellation. Feedback points out that unconditionally querying logs and pending nonces on every block is highly inefficient and suggests lazily evaluating these queries only when estimate_gas actually returns a revert error.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| let (gas, mined, pending_nonce) = futures::join!( | ||
| self.ethereum.estimate_gas(tx.clone()), | ||
| self.ethereum.successful_settlement_block( | ||
| hash, | ||
| submission_block.0, | ||
| block.number, | ||
| ), | ||
| self.ethereum.pending_transaction_count(signer), | ||
| ); | ||
|
|
||
| if let Ok(Some(included_in_block)) = mined { | ||
| // Found on-chain in [submission_block, head]. The receipt-by-hash | ||
| // lookup is just lagging, and scanning the whole range catches it | ||
| // even if the block stream coalesced past the block it mined in. | ||
| // Report success now rather than wait on the receipt and risk the | ||
| // deadline. | ||
| tracing::info!( | ||
| ?hash, | ||
| ?included_in_block, | ||
| "settlement found on-chain via getLogs, treating as success" | ||
| ); | ||
| return Ok(SubmissionSuccess { | ||
| tx_hash: hash, | ||
| submitted_at_block: submission_block, | ||
| included_in_block, | ||
| }); | ||
| } | ||
|
|
||
| let resimulation_reverted = matches!(&gas, Err(err) if err.is_revert()); | ||
| if requires_cancellation(resimulation_reverted, &pending_nonce, nonce) { | ||
| tracing::info!( | ||
| settle_tx_hash = ?hash, | ||
| err = ?gas.as_ref().err(), | ||
| "tx dropped from the mempool without mining, cancelling" | ||
| ); | ||
| let _ = self.cancel(mempool, final_gas_price, signer, nonce).await; | ||
| return Err(Error::SimulationRevert { | ||
| submitted_at_block: submission_block, | ||
| reverted_at_block: block.number.into(), | ||
| }); | ||
| } else if let Err(err) = &pending_nonce { | ||
| tracing::warn!( | ||
| ?hash, | ||
| ?err, | ||
| "couldn't fetch the pending nonce, not cancelling" | ||
| ); | ||
| } else if resimulation_reverted { | ||
| tracing::debug!( | ||
| ?hash, | ||
| "tx still queued in the mempool, waiting for inclusion" | ||
| ); | ||
| } else if let Err(err) = &gas { | ||
| tracing::warn!(?hash, ?err, "couldn't re-simulate tx"); | ||
| } |
There was a problem hiding this comment.
Unconditionally querying eth_getLogs (successful_settlement_block) and eth_getTransactionCount (pending_transaction_count) on every single block is highly inefficient and will hammer the RPC node, especially on fast-block chains like Avalanche or Arbitrum. Since these checks are only needed to distinguish between a real revert and a false positive when the transaction re-simulation actually fails, they should be lazily evaluated only when estimate_gas returns a revert error.
let gas = self.ethereum.estimate_gas(tx.clone()).await;
let resimulation_reverted = matches!(&gas, Err(err) if err.is_revert());
if resimulation_reverted {
let (mined, pending_nonce) = futures::join!(
self.ethereum.successful_settlement_block(
hash,
submission_block.0,
block.number,
),
self.ethereum.pending_transaction_count(signer),
);
if let Ok(Some(included_in_block)) = mined {
tracing::info!(
?hash,
?included_in_block,
"settlement found on-chain via getLogs, treating as success"
);
return Ok(SubmissionSuccess {
tx_hash: hash,
submitted_at_block: submission_block,
included_in_block,
});
}
if requires_cancellation(resimulation_reverted, &pending_nonce, nonce) {
tracing::info!(
settle_tx_hash = ?hash,
err = ?gas.as_ref().err(),
"tx dropped from the mempool without mining, cancelling"
);
let _ = self.cancel(mempool, final_gas_price, signer, nonce).await;
return Err(Error::SimulationRevert {
submitted_at_block: submission_block,
reverted_at_block: block.number.into(),
});
} else if let Err(err) = &pending_nonce {
tracing::warn!(
?hash,
?err,
"couldn't fetch the pending nonce, not cancelling"
);
} else {
tracing::debug!(
?hash,
"tx still queued in the mempool, waiting for inclusion"
);
}
} else if let Err(err) = &gas {
tracing::warn!(?hash, ?err, "couldn't re-simulate tx");
}There was a problem hiding this comment.
That loop runs per in-flight settlement, not globally per block, so I think it is fine to fire everything at once.
|
Once you tested this on avalanche please share a graph of the improved submission rate or some other proof that this does indeed work correctly in prod. |
MartinquaXD
left a comment
There was a problem hiding this comment.
LGTM with some minor details.

Description
On some chains (Avalanche especially) the node reports a new block before its by-hash receipt lookup catches up. Right after our settlement mines, the driver asks "did my tx land?" by hash, gets "not found", and re-simulates the tx against the pending state to check it still works. That state already includes our just-mined (or still-queued) settlement, so the re-run reverts (
GPv2: order filledon a full fill,transfer amount exceeds balanceon a partial). The driver reads that as a real revert, cancels (the cancel then fails withnonce too low), and reports the settlement as failed even though it landed. This is the main source of the Avalanche "too many failing settlements" alert, and the settlements actually succeed on-chain.Instead of trusting the lagging by-hash lookup, the driver now leans on two signals that do not lag:
Settlementevent in the latest block. The contract emits it only on a successful settle, andeth_getLogsreads block logs directly, so it sees the event while the receipt-by-hash lookup is still catching up. If it is there, the tx mined, so report success.eth_getTransactionCountwith thependingtag). While our tx is still in the mempool the pending nonce sits above the nonce we submitted with, so the re-simulation revert is just our own queued tx re-applied, a false positive, and we keep waiting. Only when the pending nonce shows our tx has left the mempool, and the re-simulation still reverts, do we cancel.An earlier version of this PR used
eth_getLogson thependingblock for the second signal, but measuring our nodes showedpendinglog queries are rejected on 4 of 11 chains (polygon, bnb, arbitrum, linea, with "pending logs are not supported").eth_getTransactionCountonpendingis accepted on all 11, and since the re-simulation already runs against thependingstate, the two signals read the same state.Old code cancelled whenever the re-simulation reverted, minus a brittle
GPv2: order filledstring match that only caught full fills. New code cancels only when the tx is provably gone from the mempool, so a settlement that is mined-but-receipt-lagging or still queued is never cancelled.Changes
Ethereum::contains_successful_tx, which checks for ourSettlementevent in the latest block viaeth_getLogs.Ethereum::pending_transaction_count, the signer's nonce including the mempool.GPv2: order filledrevert-byte special case.How to test
New unit tests cover the cancel decision (
requires_cancellation): a revert with the tx gone from the mempool cancels, a revert while the tx is still queued waits, and an unknown pending nonce never cancels.The log and nonce lookups and the receipt-lag race itself have no automated coverage. The e2e harness runs anvil, which returns receipts immediately, so it cannot reproduce the lag. Verify on Avalanche: settlements that land on-chain should stop being counted as
driver_settlements{result="SubmissionError"}and should stop logging a cancel right after a re-simulation revert, while still showing up as on-chain settlements.