Spark Platform centralizes dependency and image management for Spark applications built with Gradle. Application projects choose a Spark line and runtime variants; the platform owns Spark, Hadoop, Scala, and variant runtime artifacts so local, CI, and production builds resolve the same stack.
The project provides:
- a Gradle plugin:
org.openprojectx.spark.platform - a platform BOM:
org.openprojectx.spark.platform:platform-bom - a platform base-image module that packages selected runtime jars
- a version matrix for Apache Spark and Cloudera Spark lines, Hadoop, Iceberg, Hudi, Paimon, and OpenLineage
The plugin adds platform-owned constraints from the selected line and variants.
Application projects opt into the Spark or variant APIs they compile against by
adding versionless dependencies to sparkPlatform. Local builds expose those
dependencies through implementation so the app can run from an IDE or Gradle.
Official builds expose them through compileOnly because the platform image
provides them at runtime. Gradle-run JVM tests and smoke runs still receive a
platform runtime classpath in CI, so tests do not depend on whatever happened to
be installed on a developer machine. Application builds should manage only their
own classes, application-specific libraries, and the platform APIs they actually
use.
managed is a Gradle resolution contract: it adds strict constraints, but it
does not package jars into the user application image by itself. Runtime jars
are owned by image layers:
| Layer | Contains |
|---|---|
| Spark base image | Spark, Scala, Hadoop, and line-managed runtime jars such as Spark SQL Kafka and Spark Avro. |
| Platform image | Selected variant/addon jars such as Iceberg, Hudi, Paimon, OpenLineage, Hadoop AWS, Hadoop GCS, and Iceberg AWS. |
| Application image | User classes, resources, and application-owned libraries. |
Any scope listed in managedConfigurations can use managed dependencies without
versions, including api, implementation, and testImplementation.
Official builds manage test scopes by default so versionless test dependencies
resolve in CI.
Packaging still follows that scope: implementation(...) is packaged into the
app image, while sparkPlatform(...) is the provided-platform scope. A
production ClassNotFoundException for a Spark/Hadoop/variant class means the
selected base/platform image is missing that platform-owned jar; the fix is to
update the platform image contract, not to package another Spark jar inside the
app image.
Apply the plugin and select the platform contract:
plugins {
application
java
id("org.openprojectx.spark.platform")
}
sparkPlatform {
line.set("spark4")
variants.set(listOf("iceberg"))
platformVersion.set("0.1.1-SNAPSHOT")
}
dependencies {
sparkPlatform("org.apache.spark:spark-sql_2.13")
sparkPlatform("org.apache.iceberg:iceberg-spark-runtime-4.0_2.13")
}For BOM-style use on a normal Gradle scope, target that scope and keep the dependency versionless:
sparkPlatform {
line.set("spark4")
variants.set(listOf("iceberg"))
managedConfigurations.set(listOf("api", "testImplementation"))
}
dependencies {
api("org.apache.spark:spark-sql_2.13")
testImplementation("org.apache.iceberg:iceberg-spark-runtime-4.0_2.13")
}See the standalone examples Gradle build for runnable applications.
cd examples
env GRADLE_USER_HOME=/data/.gradle ../gradlew :spark4-iceberg:run --no-configuration-cache
env GRADLE_USER_HOME=/data/.gradle ../gradlew :spark3-paimon:run --no-configuration-cache| Module | Purpose |
|---|---|
core |
Shared catalog naming and normalization logic. |
plugin |
Gradle plugin implementation and TestKit coverage. |
platform-bom |
Java Platform BOM generated from the version catalog. |
platform-image |
Jib-based platform image with selected runtime jars. |
examples |
Standalone multi-project build for runnable examples. |
- User reference:
docs/user-reference.adoc - Contribution guide:
CONTRIBUTING.md
Use the existing Gradle cache when working in this repository:
env GRADLE_USER_HOME=/data/.gradle ./gradlew test --no-configuration-cacheRun the Spark 4 example with:
cd examples
env GRADLE_USER_HOME=/data/.gradle ../gradlew :spark4-iceberg:run --no-configuration-cacheRun the Spark 3 Paimon example with:
cd examples
env GRADLE_USER_HOME=/data/.gradle ../gradlew :spark3-paimon:run --no-configuration-cacheBuild the default Spark 4 platform images locally with:
env GRADLE_USER_HOME=/data/.gradle ./gradlew :platform-image:jibDockerBuildPlatformImages \
-PsparkPlatform.line=spark4Platform images use project-owned clean Spark base images such as
ghcr.io/openprojectx/spark:3.5.8-scala2.12-java17-python3-r-ubuntu and
ghcr.io/openprojectx/spark:4.0.1-scala2.13-java17-python3-r-ubuntu, then layer
only the selected variant and addon jars into /opt/spark/jars. Spark, Scala,
Hadoop, and the core runtime jars are assembled by spark-base-image from the
Gradle version catalog and BOM, so hadoopSpark3 and hadoopSpark4 are real
base image contents rather than classpath overrides of jars bundled by an
upstream Spark image. Curated images use profile tags such as
spark4-lakehouse-0.1.1-SNAPSHOT; explicit custom images include selected
variants and addons in the tag.
The spark-base-image module publishes project-owned base images to
ghcr.io/openprojectx/spark for every supported Spark line. It first builds a
layout image from the verified Apache Spark distribution with /opt/spark/jars
stripped in the same Docker layer, then uses Gradle and Jib to add the
catalog-managed runtime jars. Base images are released separately through the
Base Images workflow; platform image release tasks consume the published GHCR
images and do not rebuild them.
The aggregate jibDockerBuildPlatformImages task builds each selected variant
with the selected addons individually, then builds one combined image for each
Scala-compatible variant group.
Build one explicit variant set with:
env GRADLE_USER_HOME=/data/.gradle ./gradlew :platform-image:jibDockerBuild \
-PsparkPlatform.line=spark4 \
-PsparkPlatform.variants=iceberg,hudijibDockerBuild writes to the local Docker daemon. Inspect a built image with:
docker inspect ghcr.io/openprojectx/spark-platform:spark4-iceberg-0.1.1-SNAPSHOT
docker run --rm --entrypoint sh ghcr.io/openprojectx/spark-platform:spark4-iceberg-0.1.1-SNAPSHOT \
-c 'ls -1 /opt/spark/jars | sort'The Jib image tasks are not compatible with Gradle configuration-cache reuse in the current toolchain, so the build marks those tasks incompatible and Gradle discards their configuration-cache entries. This does not disable Gradle's build cache or Jib's image layer reuse.
Application jibDockerBuild tasks use the local Docker platform image as their
base image, even in CI. Registry publishing with jib keeps the registry base
image reference.
Application images also get the platform jar directory, /opt/spark/jars/*, on
the Jib runtime classpath. Spark and variant runtime jars remain owned by the
platform image rather than being redeclared or repackaged by application
projects.
For aggregate tasks such as integration or release that invoke
jibDockerBuild indirectly, set sparkPlatform.localPlatformImage=true in the
application build or pass -PsparkPlatform.localPlatformImage=true.