spark-test

A Maven + Java port of the bigdata-test Spark JUnit 5 example (bigdata-test/example/spark). It runs the same end-to-end Spark scenario against two dependency lines — Apache Spark/Hadoop and Cloudera Spark/Hadoop — using the bigdata-test framework published locally as 0.1.1-SNAPSHOT.

What the test does

SparkBigDataScenario spins up the bigdata-test containers (HDFS, Hive Metastore, Kafka, LocalStack S3, fake-gcs, Kerberos KDC) and a local Spark session, then verifies:

an HDFS-backed S3 JCEKS credential store exists;
an Avro Kafka source can be read (Kerberos/SASL + TLS aware);
Iceberg tables can be created and queried on S3, a local GCS warehouse, and the HMS catalog (with the Iceberg table also asserted directly against the Hive Metastore);
Hive external Parquet tables can be written/read on S3 and GCS, asserted against the Hive Metastore.

The concrete SparkBigDataTestExample runs with the cloudera-hms-kerberos configuration, matching the default test in the original Gradle example.

Module layout

Module	Purpose
`spark-test-common`	The shared scenario, the `SparkBigDataTestExample` test class, and the test resources (TOML/Avro/log4j2). Holds the test code exactly once.
`spark-apache`	Runs the shared test against the Apache line (Spark 3.5.7 / Hadoop 3.4.2 / Iceberg 1.11.0).
`spark-cloudera`	Runs the shared test against the Cloudera line (Spark 3.3.2.3.3.7190.9-1 / Hadoop 3.1.1.7.1.9.14-2 / Iceberg 1.8.1).

spark-test-common declares its Spark/Hadoop dependencies as provided (compile-only). Each runtime module supplies the real versions and re-runs the shared test via the Surefire dependenciesToScan mechanism, so the two lines never collide on the classpath.

The dependency versions were taken from the original Gradle project:

GRADLE_USER_HOME=/data/.gradle ./gradlew :example:spark:dependencies \
  --configuration apacheSparkRuntimeClasspath
GRADLE_USER_HOME=/data/.gradle ./gradlew :example:spark:dependencies \
  --configuration clouderaSparkRuntimeClasspath

Plugins

hadoop-native-loader 0.1.4 — its extract goal unpacks the bundled Hadoop native libraries and prepends -Djava.library.path / -Dhadoop.home.dir to the Surefire argLine, so the native Hadoop code is used instead of the pure-Java fallback.
java-dns 0.1.1 — the agent jar (org.openprojectx.java.dns:core) is attached to the Surefire JVM via -javaagent:…=hosts=fake-gcs/127.0.0.1. This redirects the fake-gcs host to the local container, so the GCS connector can use the stable URL http://fake-gcs:4443/ and the test no longer needs to read the dynamic GCS endpoint. The fake-gcs container is bound to the fixed host port 4443 via [ports] fakeGcs = 4443 in spark-bigdata-test-common.toml so the redirected host:port lines up.

Prerequisites

JDK 17
Maven 3.9.x
Docker (Testcontainers spins up the bigdata-test services)
The bigdata-test framework installed locally as 0.1.1-SNAPSHOT (./gradlew publishToMavenLocal in the bigdata-test checkout). The java-dns and hadoop-native-loader released plugin versions resolve from the configured remote repositories.

This build honours your user ~/.m2/settings.xml (proxy + mirrors); run Maven outside any sandbox so it can reach the network.

Dependency convergence notes

Gradle resolves version conflicts to the highest requested version; Maven uses nearest-wins. A few places therefore need an explicit pin/exclusion in Maven that Gradle handled automatically:

jackson-annotations → 2.15.2 (parent dependencyManagement). Cloudera Spark requests 2.12.7, whose JsonFormat.Feature lacks READ_UNKNOWN_ENUM_VALUES_USING_DEFAULT_VALUE that Testcontainers 2.0.4's shaded Jackson needs. 2.15.2 is Apache Spark 3.5's Jackson line. jackson-databind/jackson-core are left at each Spark line's own version.
spark-apache: pin slf4j-api → 2.0.17. The Apache line pulls both the SLF4J 1.7 binding (log4j-slf4j-impl) and the 2.0 binding (log4j-slf4j2-impl); nearest-wins would keep slf4j-api:1.7.25 and the 2.0 binding then fails on org.slf4j.spi.LoggingEventBuilder.
spark-cloudera: exclude hadoop-client-api/hadoop-client-runtime from extensions. Those are Apache Hadoop 3.4.2 shaded clients whose core-default.xml uses duration strings (fs.s3a.threads.keepalivetime=60s) that Cloudera's Hadoop 3.1.1 S3A code parses as a plain number. Excluding them leaves Cloudera Spark's own Hadoop 3.1.1 as the single Hadoop on the line.

Both modules pass the full scenario (Tests run: 1, Failures: 0, Errors: 0) against a local Docker.

Build & run

# compile everything and install spark-test-common into the local repo
mvn install -DskipTests

# run the full scenario on each line (requires Docker)
mvn -pl spark-apache test
mvn -pl spark-cloudera test

# or both
mvn test

Docker image

The Dockerfile builds the whole project inside the image and keeps everything: all downloaded Maven dependencies (/root/.m2/repository), every built module jar (**/target/*.jar), the project source (/workspace/spark-test), and a source tarball (/workspace/spark-test-src.tgz, captured before the build so it is pure source with no target/). The source is copied except for the files excluded by .dockerignore, which mirrors .gitignore (so target/, .idea/, .git, etc. are not brought in). Tests are skipped during the image build because Testcontainers needs a Docker daemon that isn't available there.

docker build -t spark-test .

# Behind an HTTP proxy (e.g. local dev), pass it through:
docker build --build-arg HTTPS_PROXY=http://host.docker.internal:10809 -t spark-test .

CI / GitHub Container Registry

.github/workflows/build.yml builds that image (installing the deps and building the jars in the image build) and publishes it to GitHub Container Registry (ghcr.io/<owner>/spark-test) on pushes to the default branch and on v* tags. It needs no extra secrets — it authenticates with the built-in GITHUB_TOKEN (packages: write). Pull requests build the image but do not push.

.github/workflows/windows.yml runs on windows-latest to check Windows compatibility: it builds every module with Maven and verifies the hadoop-native-loader plugin extracts the Windows-native Hadoop artifacts (winutils.exe, hadoop.dll). The full scenario test is not run there — it needs Testcontainers (Linux containers), which GitHub-hosted Windows runners can't run; run mvn test on Linux, or on a self-hosted Windows machine backed by a Linux-container Docker engine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-test

What the test does

Module layout

Plugins

Prerequisites

Dependency convergence notes

Build & run

Docker image

CI / GitHub Container Registry

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.claude		.claude
.github/workflows		.github/workflows
spark-apache		spark-apache
spark-cloudera		spark-cloudera
spark-test-common		spark-test-common
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Folders and files

Latest commit

History

Repository files navigation

spark-test

What the test does

Module layout

Plugins

Prerequisites

Dependency convergence notes

Build & run

Docker image

CI / GitHub Container Registry

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages