Add a memory bound FileStatisticsCache for the Listing Table by mkleen · Pull Request #20047 · apache/datafusion

mkleen · 2026-01-28T13:50:43Z

Which issue does this PR close?

This change introduces a default FileStatisticsCache implementation for the Listing-Table with a size limit, implementing the following steps following #19052 (comment) :

Add heap size estimation for file statistics and the relevant data types used in caching (This is temporary until Add a crate for HeapSize trait arrow-rs#9138 is resolved)
Redesign DefaultFileStatisticsCache to use a LruQueue to make it memory-bound following Adds memory-bound DefaultListFilesCache #18855
Introduce a size limit and use it together with the heap-size to limit the memory usage of the cache
Move FileStatisticsCache creation into CacheManager, making it session-scoped and shared across statements and tables
Closes Add a default FileStatisticsCache implementation for the ListingTable #19217
Closes Add limit to DefaultFileStatisticsCache #19052

Rationale for this change

See above.

What changes are included in this PR?

See above.

Are these changes tested?

Yes.

Are there any user-facing changes?

A new runtime setting datafusion.runtime.file_statistics.cache_limit

kosiew

@mkleen

Thanks for working on this.

mkleen · 2026-02-04T12:10:45Z

@kosiew Thank you for the feedback!

kosiew

LGTM

mkleen · 2026-02-10T05:18:22Z

@kosiew Anything else needed to get this merged? Another approval maybe?

martin-g · 2026-02-10T07:04:54Z

+impl<T: DFHeapSize> DFHeapSize for Arc<T> {
+    fn heap_size(&self) -> usize {
+        // Arc stores weak and strong counts on the heap alongside an instance of T
+        2 * size_of::<usize>() + size_of::<T>() + self.as_ref().heap_size()


This won't be accurate.

let a1 = Arc::new(vec![1, 2, 3]); let a2 = a1.clone(); let a3 = a1.clone(); let a4 = a3.clone(); // this should be true because all `a`s point to the same object in memory // but the current implementation does not detect this and counts them separately assert_eq!(a4.heap_size(), a1.heap_size() + a2.heap_size() + a3.heap_size() + a4.heap_size());

The only solution I imagine is the caller to keep track of the pointer addresses which have been "sized" and ignore any Arc's which point to an address which has been "sized" earlier.

Good catch! I took this implementation from https://github.com/apache/arrow-rs/blob/main/parquet/src/file/metadata/memory.rs#L97-L102 . I would suggest to also do a follow-up here. We are planing anyway to restructure the whole heap size estimation.

mkleen · 2026-02-10T08:02:48Z

@martin-g Thanks for this great review! I am on it.

alamb · 2026-05-06T12:40:32Z

@nuno-faria and @kosiew -- shall we merge this PR? Or is it waiting on anything else?

mkleen · 2026-05-06T13:01:14Z

@nuno-faria and @kosiew -- shall we merge this PR? Or is it waiting on anything else?

@alamb No, there is a blocker. While merging in the main branch i noticed one failing test:

at /Users/mkleen/Test/datafusion/datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:700


2. query result mismatch:
[SQL] EXPLAIN SELECT b FROM (
  SELECT a, b FROM t_union_mem
  UNION ALL
  SELECT a, b FROM t_union_parquet
) WHERE b > 2;
[Diff] (-expected|+actual)
    logical_plan
    01)Projection: b
    02)--Filter: b > Int64(2)
    03)----Union
    04)------Projection: t_union_mem.a, t_union_mem.b
    05)--------TableScan: t_union_mem
    06)------Projection: t_union_parquet.a, t_union_parquet.b
-   07)--------TableScan: t_union_parquet
-   physical_plan
-   01)UnionExec
-   02)--FilterExec: b@0 > 2
-   03)----DataSourceExec: partitions=1, partition_sizes=[1]
-   04)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet]]}, projection=[b], file_type=parquet, predicate=b@1 > 2, pruning_predicate=b_null_count@1 != row_count@2 AND b_max@0 > 2, required_guarantees=[]
+   07)--------TableScan: t_union_parquet
at /Users/mkleen/Test/datafusion/datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:713

This is related to this commit d09a919 If I revert this commit everything is passing. How should i proceed here?

alamb · 2026-05-06T13:03:40Z

This is related to the this commit d09a919 If I revert this commit everything is passing. How should i proceed here?

I recommend merging up from main and hopefully it passes; I'll click the button and see what happens

alamb · 2026-05-06T13:04:27Z

We'll probably have to figure out what is causing that failure / if we should update the expected output or if there is a bug to fix

mkleen · 2026-05-06T13:07:40Z

We'll probably have to figure out what is causing that failure / if we should update the expected output or if there is a bug to fix

The query in the test is executed twice, once for the output and once for the plan (EXPLAIN) and now the caching kicks in, which wasn't the case before. I will look into the details.

mkleen · 2026-05-06T13:08:32Z

I would say it's a bug which wasn't exposed before.

mkleen · 2026-05-06T13:15:30Z

The filter pushdown does not work as expected anymore when the stats are cached. The commit d09a919 changes the stats handling for this case. I will look into it.

github-actions Bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) catalog Related to the catalog crate common Related to common crate execution Related to the execution crate labels Jan 28, 2026

mkleen force-pushed the file-stats-cache branch from a66420a to 3b33739 Compare January 28, 2026 13:56

mkleen mentioned this pull request Jan 28, 2026

Add limit to DefaultFileStatisticsCache #19052

Open

mkleen force-pushed the file-stats-cache branch from 3b33739 to 8e5560b Compare January 28, 2026 14:19

github-actions Bot removed the documentation Improvements or additions to documentation label Jan 28, 2026

mkleen force-pushed the file-stats-cache branch 2 times, most recently from e273afc to b297378 Compare January 28, 2026 14:40

github-actions Bot added the documentation Improvements or additions to documentation label Jan 28, 2026

mkleen marked this pull request as ready for review January 28, 2026 16:23

mkleen changed the title ~~Add a default FileStatisticsCache implementation for the ListingTable~~ Add a default FileStatisticsCache with a size limit Jan 28, 2026

mkleen changed the title ~~Add a default FileStatisticsCache with a size limit~~ Add a FileStatisticsCache with a size limit Jan 28, 2026

mkleen changed the title ~~Add a FileStatisticsCache with a size limit~~ Add FileStatisticsCache with a size limit Jan 28, 2026

mkleen changed the title ~~Add FileStatisticsCache with a size limit~~ Add a memory bound FileStatisticsCache with a size limit Jan 29, 2026

mkleen changed the title ~~Add a memory bound FileStatisticsCache with a size limit~~ Add a memory bound FileStatisticsCache for the Listing Table Jan 31, 2026

mkleen mentioned this pull request Jan 31, 2026

Add heap memory estimation for statistics #19599

Closed

kosiew requested changes Feb 4, 2026

View reviewed changes

Comment thread datafusion/execution/src/cache/cache_unit.rs Outdated

Comment thread datafusion/common/src/heap_size.rs

mkleen force-pushed the file-stats-cache branch from 59c6bce to 4542db8 Compare February 4, 2026 12:08

mkleen requested a review from kosiew February 4, 2026 12:10

kosiew approved these changes Feb 5, 2026

View reviewed changes

mkleen force-pushed the file-stats-cache branch from 205f96c to 92899a7 Compare February 10, 2026 05:58

martin-g reviewed Feb 10, 2026

View reviewed changes

mkleen force-pushed the file-stats-cache branch from 92899a7 to 2e3aff9 Compare February 11, 2026 14:49

mkleen added 20 commits May 5, 2026 14:03

Make list_entries table-scoped

f3c39a9

fixup! Make list_entries table-scoped

630b44e

fixup! fixup! Make list_entries table-scoped

4a52cb9

Improve heap size estimation for Arc

ddf135a

fixup! Improve heap size estimation for Arc

0c6356e

Update migration guide

3995e4e

fixup! Update migration guide

30da2c7

Improve heapsize estimation for TableReference

e9c0ec9

Improve memory handling when inserting

326698b

Fix comments in Cache Manager

e5a1049

Improve upgrade guide

bd9d05c

Fix upgrade guide

2ec39ff

Return stale entries from cache

c56eb5f

Fix upgrade guide

ef64cdc

Fix Arc<str> heapsize test

78575c1

Remove const i32 cast from heapsize estimation

cf7be58

Fix heapsize estimation for Arc<T>

fb25a1a

Fix comment in cache_manager

7dabd91

Fix linter + clippy

26ed54c

Adapt test acording to heapsize estimation changes

457032a

mkleen force-pushed the file-stats-cache branch from 8ce1d84 to 457032a Compare May 5, 2026 12:03

Merge branch 'main' into file-stats-cache

953cf60

Merge branch 'main' into file-stats-cache

14a8c74

Conversation

mkleen commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mkleen commented Feb 4, 2026

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

mkleen commented Feb 10, 2026

Uh oh!

martin-g Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

mkleen Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mkleen commented Feb 10, 2026

Uh oh!

alamb commented May 6, 2026

Uh oh!

mkleen commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented May 6, 2026

Uh oh!

alamb commented May 6, 2026

Uh oh!

mkleen commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkleen commented May 6, 2026

Uh oh!

mkleen commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

mkleen commented Jan 28, 2026 •

edited

Loading

mkleen commented May 6, 2026 •

edited

Loading

mkleen commented May 6, 2026 •

edited

Loading