A Maven + Java port of the bigdata-test Spark JUnit 5 example
(bigdata-test/example/spark). It runs the same end-to-end
Spark scenario against two dependency lines — Apache Spark/Hadoop and Cloudera
Spark/Hadoop — using the bigdata-test framework published locally as
0.1.1-SNAPSHOT.
SparkBigDataScenario spins up the bigdata-test containers (HDFS, Hive Metastore, Kafka,
LocalStack S3, fake-gcs, Kerberos KDC) and a local Spark session, then verifies:
- an HDFS-backed S3 JCEKS credential store exists;
- an Avro Kafka source can be read (Kerberos/SASL + TLS aware);
- Iceberg tables can be created and queried on S3, a local GCS warehouse, and the HMS catalog (with the Iceberg table also asserted directly against the Hive Metastore);
- Hive external Parquet tables can be written/read on S3 and GCS, asserted against the Hive Metastore.
The concrete SparkBigDataTestExample runs with the cloudera-hms-kerberos configuration, matching
the default test in the original Gradle example.
| Module | Purpose |
|---|---|
spark-test-common |
The shared scenario, the SparkBigDataTestExample test class, and the test resources (TOML/Avro/log4j2). Holds the test code exactly once. |
spark-apache |
Runs the shared test against the Apache line (Spark 3.5.7 / Hadoop 3.4.2 / Iceberg 1.11.0). |
spark-cloudera |
Runs the shared test against the Cloudera line (Spark 3.3.2.3.3.7190.9-1 / Hadoop 3.1.1.7.1.9.14-2 / Iceberg 1.8.1). |
spark-test-common declares its Spark/Hadoop dependencies as provided (compile-only). Each
runtime module supplies the real versions and re-runs the shared test via the Surefire
dependenciesToScan
mechanism, so the two lines never collide on the classpath.
The dependency versions were taken from the original Gradle project:
GRADLE_USER_HOME=/data/.gradle ./gradlew :example:spark:dependencies \
--configuration apacheSparkRuntimeClasspath
GRADLE_USER_HOME=/data/.gradle ./gradlew :example:spark:dependencies \
--configuration clouderaSparkRuntimeClasspath- hadoop-native-loader
0.1.4— itsextractgoal unpacks the bundled Hadoop native libraries and prepends-Djava.library.path/-Dhadoop.home.dirto the SurefireargLine, so the native Hadoop code is used instead of the pure-Java fallback. - java-dns
0.1.1— the agent jar (org.openprojectx.java.dns:core) is attached to the Surefire JVM via-javaagent:…=hosts=fake-gcs/127.0.0.1. This redirects thefake-gcshost to the local container, so the GCS connector can use the stable URLhttp://fake-gcs:4443/and the test no longer needs to read the dynamic GCS endpoint. The fake-gcs container is bound to the fixed host port4443via[ports] fakeGcs = 4443inspark-bigdata-test-common.tomlso the redirected host:port lines up.
- JDK 17
- Maven 3.9.x
- Docker (Testcontainers spins up the bigdata-test services)
- The
bigdata-testframework installed locally as0.1.1-SNAPSHOT(./gradlew publishToMavenLocalin the bigdata-test checkout). Thejava-dnsandhadoop-native-loaderreleased plugin versions resolve from the configured remote repositories.
This build honours your user ~/.m2/settings.xml (proxy + mirrors); run Maven outside any sandbox
so it can reach the network.
Gradle resolves version conflicts to the highest requested version; Maven uses nearest-wins. A few places therefore need an explicit pin/exclusion in Maven that Gradle handled automatically:
jackson-annotations→ 2.15.2 (parentdependencyManagement). Cloudera Spark requests2.12.7, whoseJsonFormat.FeaturelacksREAD_UNKNOWN_ENUM_VALUES_USING_DEFAULT_VALUEthat Testcontainers 2.0.4's shaded Jackson needs. 2.15.2 is Apache Spark 3.5's Jackson line.jackson-databind/jackson-coreare left at each Spark line's own version.spark-apache: pinslf4j-api→ 2.0.17. The Apache line pulls both the SLF4J 1.7 binding (log4j-slf4j-impl) and the 2.0 binding (log4j-slf4j2-impl); nearest-wins would keepslf4j-api:1.7.25and the 2.0 binding then fails onorg.slf4j.spi.LoggingEventBuilder.spark-cloudera: excludehadoop-client-api/hadoop-client-runtimefromextensions. Those are Apache Hadoop 3.4.2 shaded clients whosecore-default.xmluses duration strings (fs.s3a.threads.keepalivetime=60s) that Cloudera's Hadoop 3.1.1 S3A code parses as a plain number. Excluding them leaves Cloudera Spark's own Hadoop 3.1.1 as the single Hadoop on the line.
Both modules pass the full scenario (Tests run: 1, Failures: 0, Errors: 0) against a local Docker.
# compile everything and install spark-test-common into the local repo
mvn install -DskipTests
# run the full scenario on each line (requires Docker)
mvn -pl spark-apache test
mvn -pl spark-cloudera test
# or both
mvn testThe Dockerfile builds the whole project inside the image and keeps everything:
all downloaded Maven dependencies (/root/.m2/repository), every built module jar
(**/target/*.jar), the project source (/workspace/spark-test), and a source tarball
(/workspace/spark-test-src.tgz, captured before the build so it is pure source with no
target/). The source is copied except for the files excluded by .dockerignore,
which mirrors .gitignore (so target/, .idea/, .git, etc. are not brought in). Tests are
skipped during the image build because Testcontainers needs a Docker daemon that isn't available
there.
docker build -t spark-test .
# Behind an HTTP proxy (e.g. local dev), pass it through:
docker build --build-arg HTTPS_PROXY=http://host.docker.internal:10809 -t spark-test ..github/workflows/build.yml builds that image (installing the deps
and building the jars in the image build) and publishes it to GitHub Container Registry
(ghcr.io/<owner>/spark-test) on pushes to the default branch and on v* tags. It needs no extra
secrets — it authenticates with the built-in GITHUB_TOKEN (packages: write). Pull requests build
the image but do not push.
.github/workflows/windows.yml runs on windows-latest to check
Windows compatibility: it builds every module with Maven and verifies the hadoop-native-loader
plugin extracts the Windows-native Hadoop artifacts (winutils.exe, hadoop.dll). The full
scenario test is not run there — it needs Testcontainers (Linux containers), which GitHub-hosted
Windows runners can't run; run mvn test on Linux, or on a self-hosted Windows machine backed by a
Linux-container Docker engine.