-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Wrap immutable plan parts into Arc #19893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b9caf1d to
bb97763
Compare
|
run benchmark reset_plan_states |
|
🤖 Hi @askalt, thanks for the request (#19893 (comment)). |
|
@xudong963 could you please |
|
run benchmark reset_plan_states |
|
🤖 |
|
🤖: Benchmark completed Details
|
70a396a to
3614687
Compare
3614687 to
bb97763
Compare
bb97763 to
f2d829a
Compare
f2d829a to
050e6bb
Compare
|
I understand that we wanna avoid breaking changes and hence are kinda liberal with Hence I would favor if we could avoid these |
Actually, I also think that using a generic here isn't justified, since |
Indeed my concern is the impact on (all) downstream users during the upgrade. I believe the initial PR from @askalt was substantially larger because it required changing a bunch of callsites. If we can avoid the use of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @askalt and for the review @crepererum and @martin-g
I think the performance results are quite compelling and this is quite close.
I agree it would be find to avoid the AsRef impls, if you are willing to do so @askalt . I also think it would be fine to leave this as is.
In any event, I think we should add an example to project_schema that shows how to call it with Option<&Vec<usize>> (what is passed by TableProvider::scan)
Before we merge this PR I think we
- should also add a note to the upgrade guide with a note about this change (I can help with this, but I didn't want to push changes to this PR before we have consensus approach)
- Clean up the API for
OptionProjectionRef(move it, and makeOption<ProjectionRef>)
datafusion/common/src/utils/mod.rs
Outdated
| pub fn project_schema( | ||
| schema: &SchemaRef, | ||
| projection: Option<&Vec<usize>>, | ||
| projection: Option<&impl AsRef<[usize]>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per @crepererum 's suggestion, I tried out what it would look like and came up with
It does seem to be reasonable
The original signature is Option<&Vec<..>> I think to align with TableProvider::scan (which also shouldn't have a owned Vec, but that I think is a historic accident)
| mode: AggregateMode, | ||
| /// Group by expressions | ||
| group_by: PhysicalGroupBy, | ||
| group_by: Arc<PhysicalGroupBy>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here I think it would be good to note they are Arc to make clone/plan rewriting faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| /// Returns an internal error if existing projection contains index that is | ||
| /// greater than len of the passed `projection`. | ||
| /// | ||
| pub fn apply_projection<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is basically refactored out of NestedLoops join (and maybe elsewhere)
| } | ||
| } | ||
|
|
||
| /// Describes an option immutable reference counted shared projection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why you need something like this (as it changes the signture of TableProvider) but this seems inconsistent with the other representations of projections in DataFusion/Arrow I am familiar with, which are represented with an Option rather than having the Option internally
I recommend:
- Move this into
datafusion/physical-expr/src/projection.rs(not in physical plan) so it is nearProjectionExprs - Move the Option outside (so the signature is
Option<ProjectionRef>rather thanOptionProjectionRef)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also recommend documenting why this structure exists (because it is cheap to clone)
Something like
/// This structure represents projecting a set of columns by index.
/// It uses an `Arc` internally to make it cheap to clone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Move this into datafusion/physical-expr/src/projection.rs (not in physical plan) so it is near ProjectionExprs
Done
I also recommend documenting why this structure exists (because it is cheap to clone)
Done
- Move the Option outside (so the signature is Option rather than OptionProjectionRef)
Initially I done in this way, but in this case, take a look at possible HashJoinExec::try_new signature. It seems we want something like
fn try_new(
...
projection: Option<impl Into<ProjectionRef>>,
...
)To be able to pass Some(vec![1,2,3]) and Some(projection_ref) as well. But Option<impl ...> forces us to explicitly specify None type, like None::<ProjectionRef> which looks annoying for me. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should try it -- #19893 (review)
Specotying the None type does look annoying -- maybe we could make a version of the function that takes no projecton -- like try_new_without_projection 🤔
|
A slightly more verbose trick to avoid the code bloat but still get the pub fn foo(&self, param: String) -> ... {
...
}would become pub fn foo(&self, param: impl Into<String>) -> ... {
self.foo_impl(param.into())
}
fn foo_impl(&self, param: String) -> ... {
// that's just the "old" `foo` method
...
}i.e. you basically control the generic part a bit more manual. That was what once was planned (but no longer is) as an automatic step, see rust-lang/rust#124962 |
050e6bb to
20b3042
Compare
20b3042 to
e6fed67
Compare
|
I decided to simplify |
5768265 to
1e27378
Compare
- Closes apache#19852 Improve performance of query planning and plan state re-set by making node clone cheap. - Store projection as `Option<Arc<[usize]>>` instead of `Option<Vec<usize>>` in `FilterExec`, `HashJoinExec`, `NestedLoopJoinExec`. - Store exprs as `Arc<[ProjectionExpr]>` instead of Vec in `ProjectionExprs`. - Store arced aggregation, filter, group by expressions within `AggregateExec`.
1e27378 to
2c0746c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry this is taking so long, but I a very worried about API churn and it takes me a non trivial amount of time to review the potential implications of this change
I have two major points:
- I really think it is important to not force downstream users to change how they call
project_schemaas it is a very common operation and there is no technical reason we can't keep their code the same (though it is harder in Datafusion). I made a PR to do so (and backout many of the changes currently required in this PR): askalt#5 as suggested by @crepererum - As I mentioned previously
OptionProjectionRefseems overly complicated to me to add to the public API -- and because theOptionis wrapped, it means that the normal functions forOption(like match and .and_then, etc) don't work. I think it would be simpler, and follow the rest of the code more closely if it wereOption<ProjectionRef>and the special projection application code is put into ProjectionRef
Thank you for your work on this @askalt
Ok, I will do it. |
Which issue does this PR close?
Arc#19852Rationale for this change
Improve performance of query planning and plan state re-set by making node clone cheap.
What changes are included in this PR?
Option<Arc<[usize]>>instead ofOption<Vec<usize>>inFilterExec,HashJoinExec,NestedLoopJoinExec.Arc<[ProjectionExpr]>instead of Vec inProjectionExprs.AggregateExec.