[spark] support persist source data to avoid loading data repeatedly#8081
[spark] support persist source data to avoid loading data repeatedly#8081Stefanietry wants to merge 1 commit into
Conversation
4e99876 to
cdf7d4c
Compare
| + "outweighs the benefit of pruning untouched files."); | ||
|
|
||
| public static final ConfigOption<Boolean> DATA_EVOLUTION_DATA_SOURCE_PERSIST_ENABLED = | ||
| key("data-evolution.data.source.persist.enabled") |
There was a problem hiding this comment.
data-evolution.merge-into.source-persist
There was a problem hiding this comment.
Done, the conf has been modified as suggested.
a3b196c to
abfcebd
Compare
abfcebd to
95e9f9b
Compare
| + " 'manifest.compression' = 'snappy',\n" | ||
| + " 'row-tracking.enabled' = 'true',\n" | ||
| + " 'data-evolution.enabled' = 'true',\n" | ||
| + " 'data-evolution.data.source.persist.enabled' = 'true',\n" |
There was a problem hiding this comment.
This still uses the old option name. The PR adds data-evolution.merge-into.source-persist, so this table keeps the new option at its default false and the test never exercises the persist path. Please switch this to the new key.
| val sourceTableProjExprs = | ||
| allReadFieldsOnSource.toSeq :+ Alias(TrueLiteral, ROW_FROM_SOURCE)() | ||
| val sourceTableProj = Project(sourceTableProjExprs, sourceTable) | ||
| val sourceChild = persistSourceDss.map(_.queryExecution.logical).getOrElse(sourceTable) |
There was a problem hiding this comment.
This only wires the cached source into the matched/update path. For a MERGE that has both matched and not-matched clauses, insertActionInvoke still builds its left-anti join from sourceTable, so the source is scanned again after the update path. Could you pass the persisted source into the insert path too, so the new option avoids repeated source loading for the whole merge action?
|
The Spark 4.0 implementation also needs the same change. |
Purpose
Purpose: In the UpdateAction mode, it avoids redundant calculations during the process of computing dataSplits and performing join concatenation by persisting the source data.
Linked issue: #8080
Tests
Add SparkDataEvolutionITCase