Skip to content

[Feature Request]: Add S3-compatible storage support for Iceberg managed I/O #37614

@dejii

Description

@dejii

What would you like to happen?

The Apache Beam Iceberg connector currently only supports Google Cloud Storage (GCS) out of the box. The expansion service JAR includes iceberg-gcp, but does not bundle iceberg-aws, which is required for AWS S3 and S3-compatible storage backends (e.g., MinIO, Supabase Storage, etc.).

Attempting to write to an S3-compatible destination using the current expansion service results in the following error when running on Dataflow:

Error message from worker: org.apache.beam.sdk.util.UserCodeException: java.lang.IllegalArgumentException: Cannot initialize FileIO implementation org.apache.iceberg.aws.s3.S3FileIO: Cannot find constructor for interface org.apache.iceberg.io.FileIO

Missing org.apache.iceberg.aws.s3.S3FileIO [java.lang.ClassNotFoundException: org.apache.iceberg.aws.s3.S3FileIO]

Current workarounds:

  • Building and deploying a custom expansion service JAR that includes iceberg-aws
  • Using IcebergIO directly (which is generally discouraged in favor of using Managed IO)

Proposal

Bundle iceberg-aws with the official Iceberg expansion service JAR to enable native S3 and S3-compatible storage support.

This would allow writing to S3-compatible destinations using a REST-based catalog configuration such as:

ImmutableMap<String, String> catalogProperties = ImmutableMap.<String, String>builder()
    .put("type", "rest")
    .put("uri", options.getCatalogUri())
    .put("token", options.getCatalogToken())
    .put("warehouse", options.getWarehouse())
    .put("client.region", "us-east-1")
    .put("s3.endpoint", options.getS3Endpoint())
    .put("s3.access-key-id", options.getS3AccessKeyId())
    .put("s3.secret-access-key", options.getS3SecretAccessKey())
    .put("s3.path-style-access", "true")
    .put("s3.remote-signing-enabled", "false")
    .build();

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions