HIVE-29536: Stabilize rebalance compaction tests by InvisibleProgrammer · Pull Request #6487 · apache/hive

InvisibleProgrammer · 2026-05-15T07:34:17Z

Rebalance tests are sensitive and the hard-coded assertions need to be modified regularly.
Some examples:

There are two causes identified:

Firstly, the number of buckets and even the order of the elements inside a bucket depends on the version string of Orc: https://issues.apache.org/jira/browse/HIVE-29536?focusedCommentId=18080335&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-18080335 (Thanks @thomasrebele to digging into it)
Secondly, the base directory can change as well (like here: 1a90d27#diff-dedd154465fd42855d9d6710d54553660dae87405ce2e4ea931475de1d5bb816L199)

What changes were proposed in this pull request?

The goal of the change is to stabilize those tests by doing two things:

Rebalance assertions are not hard-coded. Instead of that, we can check if the buckets are balanced or not and if all the data is available.
Base folder can be searched dinamically

Please note: I also refactored the code little bit and extracted rebalance compaction tests into a new class.

Why are the changes needed?

We experienced regular and serious regression issues due to the effect of the orc version number.

Does this PR introduce any user-facing change?

No

How was this patch tested?

With the existing tests.

thomasrebele

Thank you for working on the fix! I've added some suggestions and requests for improving it.

thomasrebele · 2026-05-15T11:58:56Z

+        .reduce(0, Integer::sum);
+
+    int optimalRecordsInBucket = allRecordCount / bucketCount;
+    int maximumRecordCountInABucket = optimalRecordsInBucket + bucketCount - 1;


See comment https://github.com/apache/hive/pull/6487/changes#r3248007538

sonarqubecloud · 2026-05-26T19:39:06Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

zabetak

Many thanks for the PR @InvisibleProgrammer ! Left a bunch of minor comments that we don't necessarily need to address.

However, I would really like to understand which tests should verify data and which tests should verify buckets and how we should choose one or the other.

No need to apply code changes right now just reply to the comments and we can decide how to advance based on the answers.

zabetak · 2026-06-02T09:32:50Z

+    // Check if the compaction succeed
+    verifyCompaction(1, TxnStore.CLEANING_RESPONSE);
+
+    String[][] expectedBuckets = new String[][] {


In some rebalance tests like this one we use explicit buckets (exptectedBuckets) along with verifyRebalance and in others we use just data (expectedData) together with verifyDataAfterCompaction. How do we determine if a test should use one or the other?

zabetak · 2026-06-02T09:33:50Z

+    conf.setBoolVar(HiveConf.ConfVars.HIVE_COMPACTOR_GATHER_STATS, false);
+    conf.setBoolVar(HiveConf.ConfVars.HIVE_STATS_AUTOGATHER, false);
+
+    //set grouping size to have 3 buckets, and re-create driver with the new config


Is the comment still relevant? Are we going to have 3 buckets at the end of the compaction?

zabetak · 2026-06-02T09:35:54Z

+    /*
+      Validate the data after the test case
+        - the table is balanced (or if not, only numberOfDeletedRows amount of rows are missing
+        - there is only one writeId
+        - buckets has unique bucketId and the bucketId doesn't change inside a bucket
+        - data is sorted by column b (so the order of column a is not predictable)
+        - all the required value present
+     */


This could be a Javadoc comment since it seems to be more than just an implementation detail.

zabetak · 2026-06-02T09:39:05Z

+        fs.globStatus(searchPath, AcidUtils.baseFileFilter))
+        .map(FileStatus::getPath)
+        .map(Path::getName)
+        .sorted()


Can we have more than one base? If yes is it a valid scenario?

zabetak · 2026-06-02T09:48:13Z

+    int optimalRecordsInBucket = allRecordCount / bucketCount;
+    int maximumRecordCountInABucket = (allRecordCount + bucketCount - 1) / bucketCount;
+
+    for (int i = 0; i < bucketCount; i++) {
+      if (bucketData[i].size() > maximumRecordCountInABucket || bucketData[i].size() < optimalRecordsInBucket) {
+        return false;
+      }
+    }


nit: As far as I understand, optimalRecordsInBucket is a lowerBound and maximumRecordCountInAAbucket is an upperBound for the bucket size. Using the lower/upperBound naming could make the code a bit more easier to follow.

zabetak · 2026-06-02T09:53:13Z

+  record RowData(String colA, Long colB) {}
+
+  record RowInfo(long writeId, long bucketId, long rowId, RowData rowData) {
+    private static final ObjectMapper MAPPER = new ObjectMapper();
+
+    static RowInfo fromRawString(String row) throws JsonProcessingException {
+      // Example row data to parse: "{\"writeid\":7,\"bucketid\":537001984,\"rowid\":10}\t5\t4",
+
+      String[] parts = row.split("\t", 3);
+
+      JsonNode json = MAPPER.readTree(parts[0]);
+
+      return new RowInfo(
+          json.get("writeid").asLong(),
+          json.get("bucketid").asLong(),
+          json.get("rowid").asLong(),
+          new RowData(
+              parts[1], // colA
+              Long.parseLong(parts[2])  // colB
+          )
+      );
+    }
+  }


This record classes could potentially be used by other compaction tests but putting them here makes them bit harder to find. Possibly a better fit would be TestDataProvider associated with APIs that return RowInfo objects instead of strings. Anyways just an idea, I am fine to leave them here as well.

zabetak · 2026-06-02T09:54:40Z

+    expectedData.addAll(List.of(
+        new RowData("6", 4L),
+        new RowData("3", 4L),
+        new RowData("4", 4L),
+        new RowData("2", 4L),
+        new RowData("5", 4L)
+    ));


It's a bit strange that for some data we use directly Set#add and for other we pass through Set#addAll and List.

kuczoram · 2026-06-01T12:35:59Z

+    AcidOutputFormat.Options options = new AcidOutputFormat.Options(conf);
+
+    /*
+      Validate the data after the test case


The rowId should be checked as well. It has to be increasing within a file, otherwise the delete operation won't work.

kuczoram · 2026-06-01T12:37:11Z

+    verifyCompaction(1, TxnStore.CLEANING_RESPONSE);
+
+    // Populate expected data
+    Set<RowData> expectedData = new HashSet<>();


Why do you hard-code the expected values? Why not just run a select before and after the compaction and compare the results?

kuczoram · 2026-06-01T12:39:50Z

+            "{\"writeid\":7,\"bucketid\":537067520,\"rowid\":17}\t17\t17",
+        },
+    };
+    verifyRebalance(testDataProvider, tableName, null, expectedBuckets,


I thought that the idea of this fix is to have one universal way of validating the result of the rebalance compaction and get rid of the hard-coded data. Why did you keep this? Now we have some tests which using the new way of validation and some tests which using the old way of validation. I don't really like it. We should use one approach to validate the data and use it in all rebalance tests.

kuczoram · 2026-06-01T12:41:30Z

+
+  @Test
+  public void testRebalanceCompactionOfNotPartitionedImplicitlyBucketedTableWithOrder() throws Exception {
+    conf.setBoolVar(HiveConf.ConfVars.COMPACTOR_CRUD_QUERY_BASED, true);


Would it make sense to extract these config settings into one place?

kuczoram · 2026-06-01T12:47:15Z

+            "{\"writeid\":1,\"bucketid\":537001984,\"rowid\":3}\t1\t4\ttomorrow",
+        },
+    };
+    for(int i = 0; i < 3; i++) {


I am just wondering why we need this data validation before the compaction? Do you know anything about the reason? Does it matter how the rows look like before the compaction or the intention here is rather to check if the data is imbalanced?

asf-ci-hive added tests pending tests passed and removed tests pending labels May 15, 2026

thomasrebele suggested changes May 15, 2026

View reviewed changes

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels May 20, 2026

thomasrebele suggested changes May 21, 2026

View reviewed changes

Comment thread .../hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestRebalanceCompactor.java

Comment thread .../hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestRebalanceCompactor.java Outdated

asf-ci-hive added tests passed tests pending and removed tests pending tests passed labels May 21, 2026

zsmiskolczi added 4 commits May 26, 2026 15:33

HIVE-29536: Stabilize rebalance compaction tests

4d244b8

Address review comments

70b046c

Address SonarQube issues

bf000be

Address review comments

9f5cedf

InvisibleProgrammer force-pushed the fix_rebalance_tests_flakyness branch from bba153e to 9f5cedf Compare May 26, 2026 13:33

asf-ci-hive added tests failed and removed tests pending labels May 26, 2026

thomasrebele approved these changes May 26, 2026

View reviewed changes

asf-ci-hive added tests pending and removed tests failed labels May 26, 2026

asf-ci-hive added tests unstable and removed tests pending labels May 26, 2026

zabetak reviewed Jun 2, 2026

View reviewed changes

kuczoram reviewed Jun 3, 2026

View reviewed changes

Conversation

InvisibleProgrammer commented May 15, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

thomasrebele left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 26, 2026

Quality Gate passed

Uh oh!

zabetak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants