Skip to content

HIVE-29536: Stabilize rebalance compaction tests#6487

Open
InvisibleProgrammer wants to merge 4 commits into
apache:masterfrom
InvisibleProgrammer:fix_rebalance_tests_flakyness
Open

HIVE-29536: Stabilize rebalance compaction tests#6487
InvisibleProgrammer wants to merge 4 commits into
apache:masterfrom
InvisibleProgrammer:fix_rebalance_tests_flakyness

Conversation

@InvisibleProgrammer
Copy link
Copy Markdown
Contributor

Rebalance tests are sensitive and the hard-coded assertions need to be modified regularly.
Some examples:

There are two causes identified:

What changes were proposed in this pull request?

The goal of the change is to stabilize those tests by doing two things:

  • Rebalance assertions are not hard-coded. Instead of that, we can check if the buckets are balanced or not and if all the data is available.
  • Base folder can be searched dinamically

Please note: I also refactored the code little bit and extracted rebalance compaction tests into a new class.

Why are the changes needed?

We experienced regular and serious regression issues due to the effect of the orc version number.

Does this PR introduce any user-facing change?

No

How was this patch tested?

With the existing tests.

Copy link
Copy Markdown
Contributor

@thomasrebele thomasrebele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on the fix! I've added some suggestions and requests for improving it.

.reduce(0, Integer::sum);

int optimalRecordsInBucket = allRecordCount / bucketCount;
int maximumRecordCountInABucket = optimalRecordsInBucket + bucketCount - 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for the PR @InvisibleProgrammer ! Left a bunch of minor comments that we don't necessarily need to address.

However, I would really like to understand which tests should verify data and which tests should verify buckets and how we should choose one or the other.

No need to apply code changes right now just reply to the comments and we can decide how to advance based on the answers.

// Check if the compaction succeed
verifyCompaction(1, TxnStore.CLEANING_RESPONSE);

String[][] expectedBuckets = new String[][] {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some rebalance tests like this one we use explicit buckets (exptectedBuckets) along with verifyRebalance and in others we use just data (expectedData) together with verifyDataAfterCompaction. How do we determine if a test should use one or the other?

conf.setBoolVar(HiveConf.ConfVars.HIVE_COMPACTOR_GATHER_STATS, false);
conf.setBoolVar(HiveConf.ConfVars.HIVE_STATS_AUTOGATHER, false);

//set grouping size to have 3 buckets, and re-create driver with the new config
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment still relevant? Are we going to have 3 buckets at the end of the compaction?

Comment on lines +516 to +523
/*
Validate the data after the test case
- the table is balanced (or if not, only numberOfDeletedRows amount of rows are missing
- there is only one writeId
- buckets has unique bucketId and the bucketId doesn't change inside a bucket
- data is sorted by column b (so the order of column a is not predictable)
- all the required value present
*/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a Javadoc comment since it seems to be more than just an implementation detail.

fs.globStatus(searchPath, AcidUtils.baseFileFilter))
.map(FileStatus::getPath)
.map(Path::getName)
.sorted()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have more than one base? If yes is it a valid scenario?

Comment on lines +633 to +640
int optimalRecordsInBucket = allRecordCount / bucketCount;
int maximumRecordCountInABucket = (allRecordCount + bucketCount - 1) / bucketCount;

for (int i = 0; i < bucketCount; i++) {
if (bucketData[i].size() > maximumRecordCountInABucket || bucketData[i].size() < optimalRecordsInBucket) {
return false;
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: As far as I understand, optimalRecordsInBucket is a lowerBound and maximumRecordCountInAAbucket is an upperBound for the bucket size. Using the lower/upperBound naming could make the code a bit more easier to follow.

Comment on lines +479 to +501
record RowData(String colA, Long colB) {}

record RowInfo(long writeId, long bucketId, long rowId, RowData rowData) {
private static final ObjectMapper MAPPER = new ObjectMapper();

static RowInfo fromRawString(String row) throws JsonProcessingException {
// Example row data to parse: "{\"writeid\":7,\"bucketid\":537001984,\"rowid\":10}\t5\t4",

String[] parts = row.split("\t", 3);

JsonNode json = MAPPER.readTree(parts[0]);

return new RowInfo(
json.get("writeid").asLong(),
json.get("bucketid").asLong(),
json.get("rowid").asLong(),
new RowData(
parts[1], // colA
Long.parseLong(parts[2]) // colB
)
);
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This record classes could potentially be used by other compaction tests but putting them here makes them bit harder to find. Possibly a better fit would be TestDataProvider associated with APIs that return RowInfo objects instead of strings. Anyways just an idea, I am fine to leave them here as well.

Comment on lines +108 to +114
expectedData.addAll(List.of(
new RowData("6", 4L),
new RowData("3", 4L),
new RowData("4", 4L),
new RowData("2", 4L),
new RowData("5", 4L)
));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit strange that for some data we use directly Set#add and for other we pass through Set#addAll and List.

AcidOutputFormat.Options options = new AcidOutputFormat.Options(conf);

/*
Validate the data after the test case
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rowId should be checked as well. It has to be increasing within a file, otherwise the delete operation won't work.

verifyCompaction(1, TxnStore.CLEANING_RESPONSE);

// Populate expected data
Set<RowData> expectedData = new HashSet<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you hard-code the expected values? Why not just run a select before and after the compaction and compare the results?

"{\"writeid\":7,\"bucketid\":537067520,\"rowid\":17}\t17\t17",
},
};
verifyRebalance(testDataProvider, tableName, null, expectedBuckets,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that the idea of this fix is to have one universal way of validating the result of the rebalance compaction and get rid of the hard-coded data. Why did you keep this? Now we have some tests which using the new way of validation and some tests which using the old way of validation. I don't really like it. We should use one approach to validate the data and use it in all rebalance tests.


@Test
public void testRebalanceCompactionOfNotPartitionedImplicitlyBucketedTableWithOrder() throws Exception {
conf.setBoolVar(HiveConf.ConfVars.COMPACTOR_CRUD_QUERY_BASED, true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to extract these config settings into one place?

"{\"writeid\":1,\"bucketid\":537001984,\"rowid\":3}\t1\t4\ttomorrow",
},
};
for(int i = 0; i < 3; i++) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just wondering why we need this data validation before the compaction? Do you know anything about the reason? Does it matter how the rows look like before the compaction or the intention here is rather to check if the data is imbalanced?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants