Skip to content

OpenProjectX/spark-platform

Repository files navigation

Spark Platform

Spark Platform centralizes dependency and image management for Spark applications built with Gradle. Application projects choose a Spark line and runtime variants; the platform owns Spark, Hadoop, Scala, and variant runtime artifacts so local, CI, and production builds resolve the same stack.

The project provides:

  • a Gradle plugin: org.openprojectx.spark.platform
  • a platform BOM: org.openprojectx.spark.platform:platform-bom
  • a platform base-image module that packages selected runtime jars
  • a version matrix for Apache Spark and Cloudera Spark lines, Hadoop, Iceberg, Hudi, Paimon, and OpenLineage

The plugin adds platform-owned constraints from the selected line and variants. Application projects opt into the Spark or variant APIs they compile against by adding versionless dependencies to sparkPlatform. Local builds expose those dependencies through implementation so the app can run from an IDE or Gradle. Official builds expose them through compileOnly because the platform image provides them at runtime. Gradle-run JVM tests and smoke runs still receive a platform runtime classpath in CI, so tests do not depend on whatever happened to be installed on a developer machine. Application builds should manage only their own classes, application-specific libraries, and the platform APIs they actually use.

managed is a Gradle resolution contract: it adds strict constraints, but it does not package jars into the user application image by itself. Runtime jars are owned by image layers:

Layer Contains
Spark base image Spark, Scala, Hadoop, and line-managed runtime jars such as Spark SQL Kafka and Spark Avro.
Platform image Selected variant/addon jars such as Iceberg, Hudi, Paimon, OpenLineage, Hadoop AWS, Hadoop GCS, and Iceberg AWS.
Application image User classes, resources, and application-owned libraries.

Any scope listed in managedConfigurations can use managed dependencies without versions, including api, implementation, and testImplementation. Official builds manage test scopes by default so versionless test dependencies resolve in CI. Packaging still follows that scope: implementation(...) is packaged into the app image, while sparkPlatform(...) is the provided-platform scope. A production ClassNotFoundException for a Spark/Hadoop/variant class means the selected base/platform image is missing that platform-owned jar; the fix is to update the platform image contract, not to package another Spark jar inside the app image.

Quick Start

Apply the plugin and select the platform contract:

plugins {
    application
    java
    id("org.openprojectx.spark.platform")
}

sparkPlatform {
    line.set("spark4")
    variants.set(listOf("iceberg"))
    platformVersion.set("0.1.1-SNAPSHOT")
}

dependencies {
    sparkPlatform("org.apache.spark:spark-sql_2.13")
    sparkPlatform("org.apache.iceberg:iceberg-spark-runtime-4.0_2.13")
}

For BOM-style use on a normal Gradle scope, target that scope and keep the dependency versionless:

sparkPlatform {
    line.set("spark4")
    variants.set(listOf("iceberg"))
    managedConfigurations.set(listOf("api", "testImplementation"))
}

dependencies {
    api("org.apache.spark:spark-sql_2.13")
    testImplementation("org.apache.iceberg:iceberg-spark-runtime-4.0_2.13")
}

See the standalone examples Gradle build for runnable applications.

cd examples
env GRADLE_USER_HOME=/data/.gradle ../gradlew :spark4-iceberg:run --no-configuration-cache
env GRADLE_USER_HOME=/data/.gradle ../gradlew :spark3-paimon:run --no-configuration-cache

Modules

Module Purpose
core Shared catalog naming and normalization logic.
plugin Gradle plugin implementation and TestKit coverage.
platform-bom Java Platform BOM generated from the version catalog.
platform-image Jib-based platform image with selected runtime jars.
examples Standalone multi-project build for runnable examples.

Documentation

  • User reference: docs/user-reference.adoc
  • Contribution guide: CONTRIBUTING.md

Development

Use the existing Gradle cache when working in this repository:

env GRADLE_USER_HOME=/data/.gradle ./gradlew test --no-configuration-cache

Run the Spark 4 example with:

cd examples
env GRADLE_USER_HOME=/data/.gradle ../gradlew :spark4-iceberg:run --no-configuration-cache

Run the Spark 3 Paimon example with:

cd examples
env GRADLE_USER_HOME=/data/.gradle ../gradlew :spark3-paimon:run --no-configuration-cache

Build the default Spark 4 platform images locally with:

env GRADLE_USER_HOME=/data/.gradle ./gradlew :platform-image:jibDockerBuildPlatformImages \
  -PsparkPlatform.line=spark4

Platform images use project-owned clean Spark base images such as ghcr.io/openprojectx/spark:3.5.8-scala2.12-java17-python3-r-ubuntu and ghcr.io/openprojectx/spark:4.0.1-scala2.13-java17-python3-r-ubuntu, then layer only the selected variant and addon jars into /opt/spark/jars. Spark, Scala, Hadoop, and the core runtime jars are assembled by spark-base-image from the Gradle version catalog and BOM, so hadoopSpark3 and hadoopSpark4 are real base image contents rather than classpath overrides of jars bundled by an upstream Spark image. Curated images use profile tags such as spark4-lakehouse-0.1.1-SNAPSHOT; explicit custom images include selected variants and addons in the tag.

The spark-base-image module publishes project-owned base images to ghcr.io/openprojectx/spark for every supported Spark line. It first builds a layout image from the verified Apache Spark distribution with /opt/spark/jars stripped in the same Docker layer, then uses Gradle and Jib to add the catalog-managed runtime jars. Base images are released separately through the Base Images workflow; platform image release tasks consume the published GHCR images and do not rebuild them.

The aggregate jibDockerBuildPlatformImages task builds each selected variant with the selected addons individually, then builds one combined image for each Scala-compatible variant group. Build one explicit variant set with:

env GRADLE_USER_HOME=/data/.gradle ./gradlew :platform-image:jibDockerBuild \
  -PsparkPlatform.line=spark4 \
  -PsparkPlatform.variants=iceberg,hudi

jibDockerBuild writes to the local Docker daemon. Inspect a built image with:

docker inspect ghcr.io/openprojectx/spark-platform:spark4-iceberg-0.1.1-SNAPSHOT
docker run --rm --entrypoint sh ghcr.io/openprojectx/spark-platform:spark4-iceberg-0.1.1-SNAPSHOT \
  -c 'ls -1 /opt/spark/jars | sort'

The Jib image tasks are not compatible with Gradle configuration-cache reuse in the current toolchain, so the build marks those tasks incompatible and Gradle discards their configuration-cache entries. This does not disable Gradle's build cache or Jib's image layer reuse.

Application jibDockerBuild tasks use the local Docker platform image as their base image, even in CI. Registry publishing with jib keeps the registry base image reference.

Application images also get the platform jar directory, /opt/spark/jars/*, on the Jib runtime classpath. Spark and variant runtime jars remain owned by the platform image rather than being redeclared or repackaged by application projects.

For aggregate tasks such as integration or release that invoke jibDockerBuild indirectly, set sparkPlatform.localPlatformImage=true in the application build or pass -PsparkPlatform.localPlatformImage=true.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors