Skip to content

refactor: Eliminate redundant cleanup logic in entry log extraction#4599

Draft
nodece wants to merge 2 commits intoapache:masterfrom
nodece:eliminate-redundant-from-extractMetaFromEntryLogs
Draft

refactor: Eliminate redundant cleanup logic in entry log extraction#4599
nodece wants to merge 2 commits intoapache:masterfrom
nodece:eliminate-redundant-from-extractMetaFromEntryLogs

Conversation

@nodece
Copy link
Member

@nodece nodece commented Apr 29, 2025

Motivation

The purpose of extractMetaFromEntryLogs is to extract metadata, not perform garbage collection. The logic for removing entry logs with no active ledgers is handled separately in doGcEntryLogs().

Changes

This change removes the redundant cleanup code from extractMetaFromEntryLogs to better separate concerns and avoid duplication.

@StevenLuMT
Copy link
Member

rerun failure checks

3 similar comments
@nodece
Copy link
Member Author

nodece commented May 6, 2025

rerun failure checks

@nodece
Copy link
Member Author

nodece commented May 7, 2025

rerun failure checks

@nodece
Copy link
Member Author

nodece commented May 13, 2025

rerun failure checks

* Garbage collect those entry loggers which are not associated with any active ledgers.
*/
private void doGcEntryLogs() throws EntryLogMetadataMapException {
protected void doGcEntryLogs() throws EntryLogMetadataMapException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protectd could be removed. Default access modifiers is enough.

Copy link
Member

@zymap zymap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the reason why it did the cleanup: #1124. You could declare a new method to avoid the duplicated code.

@nodece nodece marked this pull request as draft May 19, 2025 03:24
@nodece
Copy link
Member Author

nodece commented May 19, 2025

@StevenLuMT @zymap

In the current implementation, all metadata is first extracted and stored in a map before performing a full GC pass over the entire dataset. This approach can result in high memory usage and delayed cleanup, especially in environments with large or fragmented metadata stores.

I suggest we should replace the current bulk extraction + GC logic with an incremental streaming approach, as follows:

Current Approach:

Map<String, Metadata> allMetadata = extractAllMetadata();
gc(allMetadata);

Proposed Approach:

for (String key : listAllMetadataKeys()) {
    Metadata metadata = extractMetadata(key);
    gc(metadata);
}

Benefits:

  • Reduced memory footprint: Avoids holding all metadata in memory at once.
  • More timely cleanup: GC is performed immediately after extraction.
  • Improved scalability: Handles large metadata sets more gracefully.

Copy link
Contributor

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic can cleanup those no active ledger's entrylog files immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants