GCSFileSystem requires gcp extra at lookup time while S3FileSystem does not#38751
GCSFileSystem requires gcp extra at lookup time while S3FileSystem does not#38751wilmerdooley wants to merge 1 commit into
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses an issue where Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request makes the import of gcsio lazy in GCSFileSystem so that the filesystem can be registered and looked up without requiring the gcp extra dependencies. Feedback points out that changing CHUNK_SIZE from a class attribute to an instance property is a breaking change for any external code or subclasses that access GCSFileSystem.CHUNK_SIZE directly on the class, and suggests using a custom class property descriptor to preserve class-level access.
| @property | ||
| def CHUNK_SIZE(self): | ||
| """Chunk size in batch operations.""" | ||
| return self._get_gcsio_module().MAX_BATCH_OPERATION_SIZE |
There was a problem hiding this comment.
Changing CHUNK_SIZE from a class attribute to an instance @property is a breaking change for any external code or subclasses that access GCSFileSystem.CHUNK_SIZE directly on the class (which is a common pattern for capitalized constants).
If class-level access needs to be preserved, you can implement a simple class property descriptor to support both class and instance-level access:
class classproperty(object):
def __init__(self, fget):
self.fget = fget
def __get__(self, instance, owner):
return self.fget(owner)And then decorate CHUNK_SIZE with @classproperty (using cls instead of self).
There was a problem hiding this comment.
Thanks, good catch. Fixed in the latest push: CHUNK_SIZE is now exposed via a small class-property descriptor (the pattern you suggested), so GCSFileSystem.CHUNK_SIZE resolves at both the class and instance level, matching S3FileSystem's class attribute, while staying lazy. I also added a test (test_chunk_size_on_class_and_instance) covering both the class and instance access paths.
d5858e6 to
a050b13
Compare
FileSystems.get_filesystem("gs://...") raised immediately when the gcp
extra was not installed, because gcsfilesystem.py imported gcsio (and its
google-cloud-storage dependency) at module load time. When that import
failed, GCSFileSystem was never registered, unlike S3FileSystem whose
s3io imports boto3 lazily.
Import gcsio lazily so GCSFileSystem can still be looked up without the
gcp extra, deferring the dependency error to usage time (matching S3). A
single _get_gcsio_module() helper raises a clear ImportError when the
module is unavailable; CHUNK_SIZE, _gcsIO and report_lineage go through
it. Add a regression test that simulates the missing extra in a
subprocess.
Fixes apache#37445
Signed-off-by: wilmerdooley <wilmerdooley1@gmail.com>
a050b13 to
7bc5927
Compare
|
Assigning reviewers: R: @shunping for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
|
||
| try: | ||
| from apache_beam.io.gcp import gcsio | ||
| except ImportError: |
There was a problem hiding this comment.
There have been several proposed fixes, including this one, with unncessarily complex workarounds. I have proposed a proper fix #37445 (comment)
Resolves #37445.
FileSystems.get_filesystem()handled missing optional dependencies inconsistently between GCS and S3.S3FileSystemis returned even without theawsextra installed, deferring the dependency error until the filesystem is actually used.GCSFileSystem, by contrast, failed to import without thegcpextra, so it never registered for thegs://scheme andget_filesystem('gs://...')raised at lookup time.This change makes
GCSFileSystembehave likeS3FileSystem:gcsioimport (which pulls ingoogle-cloud-storageand related packages) is now lazy.GCSFileSystemstill imports and registers for thegs://scheme without thegcpextra, soget_filesystem('gs://...')returns it.gcpdependency is only required when the filesystem is actually used. At that point a clearImportErroris raised pointing atpip install apache-beam[gcp].CHUNK_SIZEpreviously readgcsio.MAX_BATCH_OPERATION_SIZEat class-definition time. It is now resolved lazily via a class-level property, so it stays accessible both on the class and on instances (matching howS3FileSystemexposes it as a plain class attribute).report_lineagenow treatsImportErrorthe same fail-safe way it already treatedValueError.Added a regression test that runs in a subprocess with the
gcsioimport blocked, confirmingget_filesystem('gs://...')returnsGCSFileSystemand that using it then raises a clearImportError.addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.