HIVE-29536: Stabilize rebalance compaction tests#6487
HIVE-29536: Stabilize rebalance compaction tests#6487InvisibleProgrammer wants to merge 4 commits into
Conversation
thomasrebele
left a comment
There was a problem hiding this comment.
Thank you for working on the fix! I've added some suggestions and requests for improving it.
| .reduce(0, Integer::sum); | ||
|
|
||
| int optimalRecordsInBucket = allRecordCount / bucketCount; | ||
| int maximumRecordCountInABucket = optimalRecordsInBucket + bucketCount - 1; |
There was a problem hiding this comment.
bba153e to
9f5cedf
Compare
|
zabetak
left a comment
There was a problem hiding this comment.
Many thanks for the PR @InvisibleProgrammer ! Left a bunch of minor comments that we don't necessarily need to address.
However, I would really like to understand which tests should verify data and which tests should verify buckets and how we should choose one or the other.
No need to apply code changes right now just reply to the comments and we can decide how to advance based on the answers.
| // Check if the compaction succeed | ||
| verifyCompaction(1, TxnStore.CLEANING_RESPONSE); | ||
|
|
||
| String[][] expectedBuckets = new String[][] { |
There was a problem hiding this comment.
In some rebalance tests like this one we use explicit buckets (exptectedBuckets) along with verifyRebalance and in others we use just data (expectedData) together with verifyDataAfterCompaction. How do we determine if a test should use one or the other?
| conf.setBoolVar(HiveConf.ConfVars.HIVE_COMPACTOR_GATHER_STATS, false); | ||
| conf.setBoolVar(HiveConf.ConfVars.HIVE_STATS_AUTOGATHER, false); | ||
|
|
||
| //set grouping size to have 3 buckets, and re-create driver with the new config |
There was a problem hiding this comment.
Is the comment still relevant? Are we going to have 3 buckets at the end of the compaction?
| /* | ||
| Validate the data after the test case | ||
| - the table is balanced (or if not, only numberOfDeletedRows amount of rows are missing | ||
| - there is only one writeId | ||
| - buckets has unique bucketId and the bucketId doesn't change inside a bucket | ||
| - data is sorted by column b (so the order of column a is not predictable) | ||
| - all the required value present | ||
| */ |
There was a problem hiding this comment.
This could be a Javadoc comment since it seems to be more than just an implementation detail.
| fs.globStatus(searchPath, AcidUtils.baseFileFilter)) | ||
| .map(FileStatus::getPath) | ||
| .map(Path::getName) | ||
| .sorted() |
There was a problem hiding this comment.
Can we have more than one base? If yes is it a valid scenario?
| int optimalRecordsInBucket = allRecordCount / bucketCount; | ||
| int maximumRecordCountInABucket = (allRecordCount + bucketCount - 1) / bucketCount; | ||
|
|
||
| for (int i = 0; i < bucketCount; i++) { | ||
| if (bucketData[i].size() > maximumRecordCountInABucket || bucketData[i].size() < optimalRecordsInBucket) { | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
nit: As far as I understand, optimalRecordsInBucket is a lowerBound and maximumRecordCountInAAbucket is an upperBound for the bucket size. Using the lower/upperBound naming could make the code a bit more easier to follow.
| record RowData(String colA, Long colB) {} | ||
|
|
||
| record RowInfo(long writeId, long bucketId, long rowId, RowData rowData) { | ||
| private static final ObjectMapper MAPPER = new ObjectMapper(); | ||
|
|
||
| static RowInfo fromRawString(String row) throws JsonProcessingException { | ||
| // Example row data to parse: "{\"writeid\":7,\"bucketid\":537001984,\"rowid\":10}\t5\t4", | ||
|
|
||
| String[] parts = row.split("\t", 3); | ||
|
|
||
| JsonNode json = MAPPER.readTree(parts[0]); | ||
|
|
||
| return new RowInfo( | ||
| json.get("writeid").asLong(), | ||
| json.get("bucketid").asLong(), | ||
| json.get("rowid").asLong(), | ||
| new RowData( | ||
| parts[1], // colA | ||
| Long.parseLong(parts[2]) // colB | ||
| ) | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
This record classes could potentially be used by other compaction tests but putting them here makes them bit harder to find. Possibly a better fit would be TestDataProvider associated with APIs that return RowInfo objects instead of strings. Anyways just an idea, I am fine to leave them here as well.
| expectedData.addAll(List.of( | ||
| new RowData("6", 4L), | ||
| new RowData("3", 4L), | ||
| new RowData("4", 4L), | ||
| new RowData("2", 4L), | ||
| new RowData("5", 4L) | ||
| )); |
There was a problem hiding this comment.
It's a bit strange that for some data we use directly Set#add and for other we pass through Set#addAll and List.
| AcidOutputFormat.Options options = new AcidOutputFormat.Options(conf); | ||
|
|
||
| /* | ||
| Validate the data after the test case |
There was a problem hiding this comment.
The rowId should be checked as well. It has to be increasing within a file, otherwise the delete operation won't work.
| verifyCompaction(1, TxnStore.CLEANING_RESPONSE); | ||
|
|
||
| // Populate expected data | ||
| Set<RowData> expectedData = new HashSet<>(); |
There was a problem hiding this comment.
Why do you hard-code the expected values? Why not just run a select before and after the compaction and compare the results?
| "{\"writeid\":7,\"bucketid\":537067520,\"rowid\":17}\t17\t17", | ||
| }, | ||
| }; | ||
| verifyRebalance(testDataProvider, tableName, null, expectedBuckets, |
There was a problem hiding this comment.
I thought that the idea of this fix is to have one universal way of validating the result of the rebalance compaction and get rid of the hard-coded data. Why did you keep this? Now we have some tests which using the new way of validation and some tests which using the old way of validation. I don't really like it. We should use one approach to validate the data and use it in all rebalance tests.
|
|
||
| @Test | ||
| public void testRebalanceCompactionOfNotPartitionedImplicitlyBucketedTableWithOrder() throws Exception { | ||
| conf.setBoolVar(HiveConf.ConfVars.COMPACTOR_CRUD_QUERY_BASED, true); |
There was a problem hiding this comment.
Would it make sense to extract these config settings into one place?
| "{\"writeid\":1,\"bucketid\":537001984,\"rowid\":3}\t1\t4\ttomorrow", | ||
| }, | ||
| }; | ||
| for(int i = 0; i < 3; i++) { |
There was a problem hiding this comment.
I am just wondering why we need this data validation before the compaction? Do you know anything about the reason? Does it matter how the rows look like before the compaction or the intention here is rather to check if the data is imbalanced?



Rebalance tests are sensitive and the hard-coded assertions need to be modified regularly.
Some examples:
...
There are two causes identified:
What changes were proposed in this pull request?
The goal of the change is to stabilize those tests by doing two things:
Please note: I also refactored the code little bit and extracted rebalance compaction tests into a new class.
Why are the changes needed?
We experienced regular and serious regression issues due to the effect of the orc version number.
Does this PR introduce any user-facing change?
No
How was this patch tested?
With the existing tests.