Skip to content

[improve][broker]Reduce the lock range of SimpleCache to enhance performance#25293

Open
poorbarcode wants to merge 12 commits intoapache:masterfrom
poorbarcode:improve/txn_simple_cache
Open

[improve][broker]Reduce the lock range of SimpleCache to enhance performance#25293
poorbarcode wants to merge 12 commits intoapache:masterfrom
poorbarcode:improve/txn_simple_cache

Conversation

@poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Mar 5, 2026

Motivation

#23062 improved the behaviour of initialising the Transaction Buffer: Synchronise the reader creation, read loop and the following process on its result. Maintain only one reader for each namespace. The reader is now not closed unless there is no snapshot read request in 1 minute.

However, SimpleCache is a global lock, and the lock force is too strong, causing all pulsar-transaction-snapshot-recover threads to get stuck in the creation of the first reader. When system resources are insufficient, the problem will be magnified infinitely

Screenshot 2026-03-05 at 22 31 03

Q1: The creation of the reader that reads __transaction_buffer_snapshot and the reading of existing messages are both executed synchronously. Is it necessary to change it to be completed asynchronously?

A1: It is not necessary, since the initialisation of all TransactionBuffers under the same namespace requires waiting for the reader to complete the processing of all messages, whether it is asynchronous or not is not important

Q2. Will these synchronisation operations cause the thread pulsar-transaction-snapshot-recover to get stuck and affect other functions

A2: No. The function of this thread is quite simple: 1. Initialise Transaction Buffer. 2. After the Transaction Buffer recovery is completed, handle the accumulated transaction write operations (subsequent writes will no longer use this thread).
So the stuck functions actually all need to wait for the reader's messages to be processed, and thus, no other impacts have been caused

Modifications

No longer wait for the initialisation of the reader to complete within the lock code block

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode added this to the 4.2.0 milestone Mar 5, 2026
@poorbarcode poorbarcode self-assigned this Mar 5, 2026
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Mar 5, 2026
@poorbarcode poorbarcode changed the title [improve][txn]Reduce the lock range of SimpleCache to enhance performce [improve][broker]Reduce the lock range of SimpleCache to enhance performce Mar 5, 2026
@BewareMyPower BewareMyPower changed the title [improve][broker]Reduce the lock range of SimpleCache to enhance performce [improve][broker]Reduce the lock range of SimpleCache to enhance performance Mar 5, 2026
Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poorbarcode Is it better just fix SimpleCache so get() does not run valueSupplier under a global synchronized lock? It will avoids per-thread TableView growth/leak risk.

@codecov-commenter
Copy link

codecov-commenter commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 91.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.77%. Comparing base (3936ce4) to head (452d8d5).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
...lsar/broker/transaction/buffer/impl/TableView.java 89.47% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #25293      +/-   ##
============================================
+ Coverage     72.25%   72.77%   +0.52%     
- Complexity    33963    34267     +304     
============================================
  Files          1954     1954              
  Lines        154639   154757     +118     
  Branches      17698    17709      +11     
============================================
+ Hits         111732   112623     +891     
+ Misses        33884    33096     -788     
- Partials       9023     9038      +15     
Flag Coverage Δ
inttests 25.76% <70.83%> (-0.04%) ⬇️
systests 22.57% <0.00%> (+0.01%) ⬆️
unittests 73.75% <91.66%> (+0.59%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...main/java/org/apache/pulsar/utils/SimpleCache.java 100.00% <100.00%> (ø)
...lsar/broker/transaction/buffer/impl/TableView.java 93.18% <89.47%> (+8.33%) ⬆️

... and 144 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@Denovo1998 Denovo1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add targeted tests for the new future lifecycle here?

The risks in this change lie not in the expected path, but in the ownership and expiration transitions:

  • the cache entry is removed or expired before the future completes
  • the future completes successfully after expiration and should close the reader exactly once
  • the future completes exceptionally and should not try to close anything

Denovo1998

This comment was marked as off-topic.

@poorbarcode poorbarcode requested a review from Denovo1998 March 9, 2026 03:47
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the comment about using ConcurrentHasMap and moving locking to ExpirableValue class to avoid race conditions with expiration.

@lhotari lhotari requested review from Copilot and removed request for Denovo1998 March 9, 2026 09:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces the lock contention in SimpleCache by moving the blocking reader creation out of the synchronized block. Previously, the get() method held the global lock while synchronously creating and waiting for a reader, which blocked all other threads. Now, the cache stores CompletableFuture<Reader<T>> instead of Reader<T>, and the lock is only held to insert/retrieve the future — the actual wait() call happens outside the lock.

Changes:

  • SimpleCache gains a new getWithCacheInfo() method that exposes the ExpirableValue wrapper, allowing callers to update the deadline after async operations complete outside the lock.
  • TableView now caches CompletableFuture<Reader<T>> and waits on the future outside the synchronized block, with updated expiration logic to handle incomplete futures.
  • A new test TableViewTest.testFailedCreateReader validates behavior when reader creation fails.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
SimpleCache.java Adds getWithCacheInfo() method and makes ExpirableValue and its members public.
TableView.java Changes cache type to CompletableFuture<Reader<T>>, moves blocking wait outside the lock, adds expiration handling for futures, and extracts closeReader() helper.
TableViewTest.java New test for failed reader creation scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@poorbarcode poorbarcode force-pushed the improve/txn_simple_cache branch from 7ef90da to 9b800b2 Compare March 16, 2026 13:34
@poorbarcode poorbarcode requested review from Denovo1998, lhotari and shibd and removed request for Denovo1998 and lhotari March 17, 2026 10:13
Comment on lines +65 to +67
this.readers = new SimpleCache<>(executor,
Math.max(clientOperationTimeoutMs + 30 * 1000, CACHE_EXPIRE_TIMEOUT_MS),
CACHE_EXPIRE_CHECK_FREQUENCY_MS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it risky to close the reader immediately after the cache expires? If the recovery task takes longer than the expiration period, it will consistently fail with a ReaderAlreadyClosedException.

try {
return wait(cachedReaderFuture.value, "create reader");
} catch (Exception e) {
readers.remove(ns);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to compare the value here to avoid removing a new, valid entry for the same namespace

Comment on lines 66 to 68
if (value.tryExpire()) {
keys.add(key);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to catch exception for this block, to avoid any exception happened from cancel() method caused expiration processing stops silently


private void closeReader(Reader<T> reader) {
reader.closeAsync().exceptionally(ex -> {
log.warn("Failed to close reader {}", ex.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.warn("Failed to close reader {}", ex.getMessage());
log.warn("Failed to close reader ", ex);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants