Add a memory bound FileStatisticsCache for the Listing Table#20047
Add a memory bound FileStatisticsCache for the Listing Table#20047mkleen wants to merge 85 commits intoapache:mainfrom
Conversation
a66420a to
3b33739
Compare
3b33739 to
8e5560b
Compare
e273afc to
b297378
Compare
59c6bce to
4542db8
Compare
|
@kosiew Thank you for the feedback! |
|
@kosiew Anything else needed to get this merged? Another approval maybe? |
205f96c to
92899a7
Compare
| impl<T: DFHeapSize> DFHeapSize for Arc<T> { | ||
| fn heap_size(&self) -> usize { | ||
| // Arc stores weak and strong counts on the heap alongside an instance of T | ||
| 2 * size_of::<usize>() + size_of::<T>() + self.as_ref().heap_size() |
There was a problem hiding this comment.
This won't be accurate.
let a1 = Arc::new(vec![1, 2, 3]);
let a2 = a1.clone();
let a3 = a1.clone();
let a4 = a3.clone();
// this should be true because all `a`s point to the same object in memory
// but the current implementation does not detect this and counts them separately
assert_eq!(a4.heap_size(), a1.heap_size() + a2.heap_size() + a3.heap_size() + a4.heap_size());The only solution I imagine is the caller to keep track of the pointer addresses which have been "sized" and ignore any Arc's which point to an address which has been "sized" earlier.
There was a problem hiding this comment.
Good catch! I took this implementation from https://github.com/apache/arrow-rs/blob/main/parquet/src/file/metadata/memory.rs#L97-L102 . I would suggest to also do a follow-up here. We are planing anyway to restructure the whole heap size estimation.
|
@martin-g Thanks for this great review! I am on it. |
92899a7 to
2e3aff9
Compare
|
@nuno-faria and @kosiew -- shall we merge this PR? Or is it waiting on anything else? |
@alamb No, there is a blocker. While merging in the main branch i noticed one failing test: This is related to this commit d09a919 If I revert this commit everything is passing. How should i proceed here? |
I recommend merging up from main and hopefully it passes; I'll click the button and see what happens |
|
We'll probably have to figure out what is causing that failure / if we should update the expected output or if there is a bug to fix |
The query in the test is executed twice, once for the output and once for the plan (EXPLAIN) and now the caching kicks in, which wasn't the case before. I will look into the details. |
|
I would say it's a bug which wasn't exposed before. |
|
The filter pushdown does not work as expected anymore when the stats are cached. The commit d09a919 changes the stats handling for this case. I will look into it. |
Which issue does this PR close?
This change introduces a default
FileStatisticsCacheimplementation for the Listing-Table with a size limit, implementing the following steps following #19052 (comment) :Add heap size estimation for file statistics and the relevant data types used in caching (This is temporary until Add a crate for HeapSize trait arrow-rs#9138 is resolved)
Redesign
DefaultFileStatisticsCacheto use aLruQueueto make it memory-bound following Adds memory-bound DefaultListFilesCache #18855Introduce a size limit and use it together with the heap-size to limit the memory usage of the cache
Move
FileStatisticsCachecreation intoCacheManager, making it session-scoped and shared across statements and tablesCloses Add a default
FileStatisticsCacheimplementation for theListingTable#19217Closes Add limit to
DefaultFileStatisticsCache#19052Rationale for this change
See above.
What changes are included in this PR?
See above.
Are these changes tested?
Yes.
Are there any user-facing changes?
A new runtime setting
datafusion.runtime.file_statistics.cache_limit