-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Summary
Propose adding a manifest-only (YAML) configuration DSL for file-based source connectors, analogous to the existing declarative CDK for HTTP API connectors. This would allow file-based connectors (Google Drive, S3, Azure Blob Storage, GCS, SFTP, OneDrive, SharePoint) to be defined entirely in YAML without custom Python code.
Motivation
Today, all 7 file-based connectors are Python connectors because the file-based source framework (FileBasedSource / AbstractFileBasedStreamReader) has no declarative equivalent. Each connector must implement Python classes for:
- Stream reader (
AbstractFileBasedStreamReader) — authentication, file listing, file reading, file upload/download - Source (
FileBasedSource) — wiring up the stream reader, spec class, and optional permissions reader - Spec (
AbstractFileBasedSpec) — connector-specific config (auth credentials, folder/bucket paths, delivery method) - Permissions reader (optional,
AbstractFileBasedStreamPermissionsReader) — ACL/identity loading
By contrast, the HTTP declarative CDK has enabled hundreds of API connectors to be manifest-only YAML.
Sonar/Coral context
Of the 21 Sonar agent connectors, source-google-drive is the only file-based connector with a corresponding Airbyte replication source. Being able to define file-based connectors declaratively would reduce the maintenance burden and make it easier to add new file storage integrations.
First target: Google Drive
source-google-drive is the proposed first case because it exercises all major file-based abstractions:
Google Drive Python modules and their responsibilities
source.py — SourceGoogleDrive(FileBasedSource)
- Wires up
SourceGoogleDriveStreamReader,SourceGoogleDriveSpec, andSourceGoogleDriveStreamPermissionsReader - Defines OAuth
AdvancedAuthspec (consent URL, token URL, scopes, output mappings)
stream_reader.py — SourceGoogleDriveStreamReader(AbstractFileBasedStreamReader)
- Authentication: OAuth2 or Service Account credentials →
google.oauth2.credentials/service_account - File listing:
files().list()with recursive folder traversal, pagination (1000 per page), shared drive support, glob matching - File reading:
files().get_media()for regular files,files().export_media()for Google Docs/Sheets/Presentations/Drawings (with MIME type conversion) - File upload/download:
MediaIoBaseDownloadchunked downloads with progress tracking and 1.5GB size limit - File size:
files().get()metadata retrieval - Custom
GoogleDriveRemoteFilemodel withid,original_mime_type,view_link,drive_id,created_at
stream_permissions_reader.py — SourceGoogleDriveStreamPermissionsReader(AbstractFileBasedStreamPermissionsReader)
- File permissions:
permissions().list()for per-file ACLs, public access detection - Identity groups: Google Admin Directory API (
users().list(),groups().list(),members().list()) - Separate Google service client for Admin API with specific scopes
spec.py — SourceGoogleDriveSpec(AbstractFileBasedSpec)
folder_urlconfig field with URL pattern validation- Two auth modes: OAuth (
OAuthCredentials) and Service Account (ServiceAccountCredentials) - Three delivery methods:
DeliverRecords,DeliverRawFiles,DeliverPermissions - Schema customization (removes legacy fields, hides API processing option)
utils.py
get_folder_id()— URL parsing to extract folder ID from Google Drive URL
exceptions.py
ErrorFetchingMetadata,ErrorDownloadingFile— custom error types extendingBaseFileBasedSourceError
Proposed YAML DSL shape (strawman)
version: "1.0.0"
type: FileBasedSource
spec:
type: GoogleDriveSpec
folder_url:
type: string
pattern: "^https://drive.google.com/.+"
credentials:
type: oneOf
options:
- type: OAuthCredentials
fields: [client_id, client_secret, refresh_token]
- type: ServiceAccountCredentials
fields: [service_account_info]
stream_reader:
type: GoogleDriveStreamReader
# Or if we want to be more generic:
type: SDKBasedStreamReader
sdk: google-drive
authentication:
type: oneOf
options:
- type: oauth2
credentials_path: "$.credentials"
- type: service_account
credentials_path: "$.credentials"
file_listing:
api: "drive.files.list"
root_path: "$.folder_url" # -> parsed to folder ID
recursive: true
page_size: 1000
shared_drives: true
file_reading:
default: "drive.files.get_media"
exportable_types:
- mime_type: "application/vnd.google-apps.document"
export_as: "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
# ...This is intentionally a rough sketch. The actual design would need to balance generality (supporting S3, Azure, GCS, SFTP, etc.) with specificity (Google Drive API quirks like exportable documents).
Key design questions
- Scope of abstraction: Should the DSL abstract over the storage SDK entirely (like
HttpRequesterdoes for HTTP), or should it be a thinner layer that references pre-built stream reader implementations? - Authentication: File storage systems use diverse auth mechanisms (OAuth, service accounts, IAM roles, connection strings, SSH keys). How much of this can be declaratively configured?
- File type handling: Google Drive has unique export semantics for Google Docs/Sheets/Presentations. How should storage-specific file handling be expressed?
- Permissions: Some connectors (Google Drive, SharePoint) support permissions/identity streams. Should this be part of the DSL?
- Incremental approach: Should we start with a minimal DSL that covers the common case (list + read files with simple auth) and extend over time?
Affected file-based connectors (all currently Python)
| Connector | Key complexity |
|---|---|
source-google-drive |
OAuth + Service Account, exportable docs, permissions, shared drives |
source-s3 |
IAM auth, bucket listing, multiple file formats |
source-azure-blob-storage |
Connection string / SAS auth, container listing |
source-gcs |
Service account auth, bucket listing |
source-sftp-bulk |
SSH key / password auth, directory traversal |
source-microsoft-onedrive |
OAuth, Graph API file listing |
source-microsoft-sharepoint |
OAuth, Graph API, site/drive discovery, permissions |
Related issues
- #714 — Survey of Manifest-Only Connectors Using Custom Components (Feature Gaps) — surveys API connector feature gaps; file-based connectors are a separate category not yet addressed
- #713 — Custom components use case analysis: HTTP Requests for Configuration Determination
- #837 — Declarative: GitHub App authentication, #838 — Declarative: Pattern-based partition routing, #835 — Declarative: Multi-token authenticator — related declarative CDK feature gaps for API connectors
CDK classes and modules involved
airbyte_cdk.sources.file_based.file_based_source.FileBasedSource— base class for all file-based sourcesairbyte_cdk.sources.file_based.file_based_stream_reader.AbstractFileBasedStreamReader— abstract stream reader (3 abstract methods:configsetter,open_file,get_matching_files)airbyte_cdk.sources.file_based.file_based_stream_permissions_reader.AbstractFileBasedStreamPermissionsReader— abstract permissions readerairbyte_cdk.sources.file_based.config.abstract_file_based_spec.AbstractFileBasedSpec— abstract config spec
Requested by Aaron ("AJ") Steers (@aaronsteers) (AJ Steers)