Skip to content

PECOBLR-1121 Tests to validate Arrow patch across JVM versions.#1144

Open
tejassp-db wants to merge 33 commits intoPECOBLR-1121/arrow-patch/stack-0from
PECOBLR-1121/arrow-patch/stack-1
Open

PECOBLR-1121 Tests to validate Arrow patch across JVM versions.#1144
tejassp-db wants to merge 33 commits intoPECOBLR-1121/arrow-patch/stack-0from
PECOBLR-1121/arrow-patch/stack-1

Conversation

@tejassp-db
Copy link
Collaborator

Description

Added tests to validate Arrow patch code paths. Added Maven profiles to validate the behaviour across JVM versions and with/without "--add-opens=java.base/java.nio=ALL-UNNAMED" JVM arguments.

By default, JVM version 11 is assumed. To use other JVM versions, the toolchain needs to be setup to point to the correct Java versions on the local machine in .m2/toolchains.xml.

Testing

Tests to validate Arrow patch behaviour.

Additional Notes to the Reviewer

Stacked PR.

Added tests to validate Arrow patch code paths. Added Maven profiles
to validate the behaviour across JVM versions and with/without
"--add-opens=java.base/java.nio=ALL-UNNAMED" JVM arguments.

By default, JVM version 11 is assumed. To use other JVM versions,
the toolchain needs to be setup to point to the correct Java versions
on the local machine in .m2/toolchains.xml.
@tejassp-db tejassp-db self-assigned this Dec 22, 2025
Test validates the patched allocator properly releases memory by repeatedly
parsing Arrow streams (1000 iterations). Designed to run with constrained
heap (-Xmx100m) to verify no resource leaks or OOM errors occur under GC
pressure.
Fix memory leaks and parse lesser rows to be within the test heap space.
* Test the patched allocator does not put the JVM into GC pressure and cause it to OOM
* (OutOfMemoryError).
*
* TODO - add a mvn profile to run this test with low max heap size -Xmx100m.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be current behavior withot using this mvn profile? Will this pick higher memory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a profile for this in a later PR stack-6.

<artifactId>maven-surefire-plugin</artifactId>
<configuration>
<!-- JaCoCo args + module opens -->
<argLine>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed for JDK 11 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow behaves fine with JDK 11, so I have not added a profile for it. Also JDK 11 is the default profile.

/** Test all Arrow supported data types can be read and written with the patched Arrow classes. */
@Tag("Jvm17PlusAndArrowToNioReflectionDisabled")
@EnabledOnJre({JRE.JAVA_17, JRE.JAVA_21})
public class DatabricksArrowPatchReaderWriterTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write this class in a modular way for maintainability?

  1. Extract base class for DataTester implementations:
  abstract class BaseDataTester implements DataTester {
      protected void assertNullAtEvenIndices(ValueVector vector, int rowCount,
              java.util.function.Function<Integer, Object> expectedValueFn) {
          for (int i = 0; i < rowCount; i++) {
              if (i % 2 == 0) {
                  assertTrue(vector.isNull(i));
              } else {
                  assertEquals(expectedValueFn.apply(i), vector.getObject(i));
              }
          }
      }
  }
  1. Split into category-specific test files:
  • ArrowNumericTypesTest.java - integers, floats, decimals
  • ArrowTemporalTypesTest.java - timestamps, dates, intervals
  • ArrowBinaryStringTypesTest.java - binary, UTF-8, views
  • ArrowComplexTypesTest.java - list, struct, map, union

Also minor typo date → data appears 20+ times:
byte[] date = writeData(...); // should be "data"

Not blocking, but would improve long-term maintainability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split it into smaller clasess . Also renamed date -> data.

* manager type when the creation of {@code RootAllocator} is not possible in the current JVM.
*
* <p>This test is in a separate class to ensure it runs in a fresh JVM with the system property set
* before Arrow's static initialization.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure this is true? I think maven surefire defaults to reusing the same jvm fork: https://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html#reuseForks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set it up in Github actions (PR - #1180) to run it with these args -DforkCount=1 -DreuseForks=false

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not add these to the pom?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find a way to do it for a single test class.

Throwable throwable = null;
try {
RootAllocator allocator = new RootAllocator();
allocator.close(); // Unreachable code.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think similar to the first PR we should try to write here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we evaluate what is the overhead to generate these files when running the test itself? If it is too much overhead, can we add a readme on how these are generated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a class to generate these files. Should we check in the generator to the code base as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should. Even if do not generate on every test run, it would be useful to retain the generator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments