Skip to content

GCSFileSystem requires gcp extra at lookup time while S3FileSystem does not#38751

Open
wilmerdooley wants to merge 1 commit into
apache:masterfrom
wilmerdooley:gcs-lazy-filesystem-lookup
Open

GCSFileSystem requires gcp extra at lookup time while S3FileSystem does not#38751
wilmerdooley wants to merge 1 commit into
apache:masterfrom
wilmerdooley:gcs-lazy-filesystem-lookup

Conversation

@wilmerdooley
Copy link
Copy Markdown

@wilmerdooley wilmerdooley commented May 30, 2026

Resolves #37445.

FileSystems.get_filesystem() handled missing optional dependencies inconsistently between GCS and S3. S3FileSystem is returned even without the aws extra installed, deferring the dependency error until the filesystem is actually used. GCSFileSystem, by contrast, failed to import without the gcp extra, so it never registered for the gs:// scheme and get_filesystem('gs://...') raised at lookup time.

This change makes GCSFileSystem behave like S3FileSystem:

  • The gcsio import (which pulls in google-cloud-storage and related packages) is now lazy. GCSFileSystem still imports and registers for the gs:// scheme without the gcp extra, so get_filesystem('gs://...') returns it.
  • The gcp dependency is only required when the filesystem is actually used. At that point a clear ImportError is raised pointing at pip install apache-beam[gcp].
  • CHUNK_SIZE previously read gcsio.MAX_BATCH_OPERATION_SIZE at class-definition time. It is now resolved lazily via a class-level property, so it stays accessible both on the class and on instances (matching how S3FileSystem exposes it as a plain class attribute).
  • report_lineage now treats ImportError the same fail-safe way it already treated ValueError.

Added a regression test that runs in a subprocess with the gcsio import blocked, confirming get_filesystem('gs://...') returns GCSFileSystem and that using it then raises a clear ImportError.


  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue where GCSFileSystem required the gcp extra dependencies to be present at import time, preventing it from being looked up via FileSystems.get_filesystem() in environments where those dependencies were not installed. By refactoring the module-level imports to be lazy and deferring validation until the filesystem is accessed, the change aligns the behavior of GCSFileSystem with S3FileSystem and improves flexibility for users who do not require GCP functionality.

Highlights

  • Lazy Loading of GCS Dependencies: Implemented lazy loading for the gcsio module in GCSFileSystem to allow the filesystem to be registered without requiring the gcp extra dependencies to be installed at import time.
  • Deferred Dependency Validation: Added a helper method _get_gcsio_module to ensure that ImportError is only raised when the filesystem is actually used, matching the behavior of S3FileSystem.
  • Regression Testing: Added a new test case in GCSFileSystemTest that simulates the absence of GCP dependencies using a subprocess to verify that GCSFileSystem remains functional for lookup even when the underlying libraries are missing.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request makes the import of gcsio lazy in GCSFileSystem so that the filesystem can be registered and looked up without requiring the gcp extra dependencies. Feedback points out that changing CHUNK_SIZE from a class attribute to an instance property is a breaking change for any external code or subclasses that access GCSFileSystem.CHUNK_SIZE directly on the class, and suggests using a custom class property descriptor to preserve class-level access.

Comment on lines +76 to +79
@property
def CHUNK_SIZE(self):
"""Chunk size in batch operations."""
return self._get_gcsio_module().MAX_BATCH_OPERATION_SIZE
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Changing CHUNK_SIZE from a class attribute to an instance @property is a breaking change for any external code or subclasses that access GCSFileSystem.CHUNK_SIZE directly on the class (which is a common pattern for capitalized constants).

If class-level access needs to be preserved, you can implement a simple class property descriptor to support both class and instance-level access:

class classproperty(object):
  def __init__(self, fget):
    self.fget = fget
  def __get__(self, instance, owner):
    return self.fget(owner)

And then decorate CHUNK_SIZE with @classproperty (using cls instead of self).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch. Fixed in the latest push: CHUNK_SIZE is now exposed via a small class-property descriptor (the pattern you suggested), so GCSFileSystem.CHUNK_SIZE resolves at both the class and instance level, matching S3FileSystem's class attribute, while staying lazy. I also added a test (test_chunk_size_on_class_and_instance) covering both the class and instance access paths.

@wilmerdooley wilmerdooley force-pushed the gcs-lazy-filesystem-lookup branch from d5858e6 to a050b13 Compare May 30, 2026 15:13
FileSystems.get_filesystem("gs://...") raised immediately when the gcp
extra was not installed, because gcsfilesystem.py imported gcsio (and its
google-cloud-storage dependency) at module load time. When that import
failed, GCSFileSystem was never registered, unlike S3FileSystem whose
s3io imports boto3 lazily.

Import gcsio lazily so GCSFileSystem can still be looked up without the
gcp extra, deferring the dependency error to usage time (matching S3). A
single _get_gcsio_module() helper raises a clear ImportError when the
module is unavailable; CHUNK_SIZE, _gcsIO and report_lineage go through
it. Add a regression test that simulates the missing extra in a
subprocess.

Fixes apache#37445

Signed-off-by: wilmerdooley <wilmerdooley1@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

Assigning reviewers:

R: @shunping for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).


try:
from apache_beam.io.gcp import gcsio
except ImportError:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There have been several proposed fixes, including this one, with unncessarily complex workarounds. I have proposed a proper fix #37445 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GCSFileSystem requires gcp extra at lookup time while S3FileSystem does not

2 participants