Skip to content

Proposal: Manifest-only (YAML) configuration DSL for file-based connectors #901

@devin-ai-integration

Description

@devin-ai-integration

Summary

Propose adding a manifest-only (YAML) configuration DSL for file-based source connectors, analogous to the existing declarative CDK for HTTP API connectors. This would allow file-based connectors (Google Drive, S3, Azure Blob Storage, GCS, SFTP, OneDrive, SharePoint) to be defined entirely in YAML without custom Python code.

Motivation

Today, all 7 file-based connectors are Python connectors because the file-based source framework (FileBasedSource / AbstractFileBasedStreamReader) has no declarative equivalent. Each connector must implement Python classes for:

  1. Stream reader (AbstractFileBasedStreamReader) — authentication, file listing, file reading, file upload/download
  2. Source (FileBasedSource) — wiring up the stream reader, spec class, and optional permissions reader
  3. Spec (AbstractFileBasedSpec) — connector-specific config (auth credentials, folder/bucket paths, delivery method)
  4. Permissions reader (optional, AbstractFileBasedStreamPermissionsReader) — ACL/identity loading

By contrast, the HTTP declarative CDK has enabled hundreds of API connectors to be manifest-only YAML.

Sonar/Coral context

Of the 21 Sonar agent connectors, source-google-drive is the only file-based connector with a corresponding Airbyte replication source. Being able to define file-based connectors declaratively would reduce the maintenance burden and make it easier to add new file storage integrations.

First target: Google Drive

source-google-drive is the proposed first case because it exercises all major file-based abstractions:

Google Drive Python modules and their responsibilities

source.pySourceGoogleDrive(FileBasedSource)

  • Wires up SourceGoogleDriveStreamReader, SourceGoogleDriveSpec, and SourceGoogleDriveStreamPermissionsReader
  • Defines OAuth AdvancedAuth spec (consent URL, token URL, scopes, output mappings)

stream_reader.pySourceGoogleDriveStreamReader(AbstractFileBasedStreamReader)

  • Authentication: OAuth2 or Service Account credentials → google.oauth2.credentials / service_account
  • File listing: files().list() with recursive folder traversal, pagination (1000 per page), shared drive support, glob matching
  • File reading: files().get_media() for regular files, files().export_media() for Google Docs/Sheets/Presentations/Drawings (with MIME type conversion)
  • File upload/download: MediaIoBaseDownload chunked downloads with progress tracking and 1.5GB size limit
  • File size: files().get() metadata retrieval
  • Custom GoogleDriveRemoteFile model with id, original_mime_type, view_link, drive_id, created_at

stream_permissions_reader.pySourceGoogleDriveStreamPermissionsReader(AbstractFileBasedStreamPermissionsReader)

  • File permissions: permissions().list() for per-file ACLs, public access detection
  • Identity groups: Google Admin Directory API (users().list(), groups().list(), members().list())
  • Separate Google service client for Admin API with specific scopes

spec.pySourceGoogleDriveSpec(AbstractFileBasedSpec)

  • folder_url config field with URL pattern validation
  • Two auth modes: OAuth (OAuthCredentials) and Service Account (ServiceAccountCredentials)
  • Three delivery methods: DeliverRecords, DeliverRawFiles, DeliverPermissions
  • Schema customization (removes legacy fields, hides API processing option)

utils.py

  • get_folder_id() — URL parsing to extract folder ID from Google Drive URL

exceptions.py

  • ErrorFetchingMetadata, ErrorDownloadingFile — custom error types extending BaseFileBasedSourceError

Proposed YAML DSL shape (strawman)

version: "1.0.0"
type: FileBasedSource

spec:
  type: GoogleDriveSpec
  folder_url:
    type: string
    pattern: "^https://drive.google.com/.+"
  credentials:
    type: oneOf
    options:
      - type: OAuthCredentials
        fields: [client_id, client_secret, refresh_token]
      - type: ServiceAccountCredentials
        fields: [service_account_info]

stream_reader:
  type: GoogleDriveStreamReader
  # Or if we want to be more generic:
  type: SDKBasedStreamReader
  sdk: google-drive
  authentication:
    type: oneOf
    options:
      - type: oauth2
        credentials_path: "$.credentials"
      - type: service_account
        credentials_path: "$.credentials"
  file_listing:
    api: "drive.files.list"
    root_path: "$.folder_url"  # -> parsed to folder ID
    recursive: true
    page_size: 1000
    shared_drives: true
  file_reading:
    default: "drive.files.get_media"
    exportable_types:
      - mime_type: "application/vnd.google-apps.document"
        export_as: "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
      # ...

This is intentionally a rough sketch. The actual design would need to balance generality (supporting S3, Azure, GCS, SFTP, etc.) with specificity (Google Drive API quirks like exportable documents).

Key design questions

  1. Scope of abstraction: Should the DSL abstract over the storage SDK entirely (like HttpRequester does for HTTP), or should it be a thinner layer that references pre-built stream reader implementations?
  2. Authentication: File storage systems use diverse auth mechanisms (OAuth, service accounts, IAM roles, connection strings, SSH keys). How much of this can be declaratively configured?
  3. File type handling: Google Drive has unique export semantics for Google Docs/Sheets/Presentations. How should storage-specific file handling be expressed?
  4. Permissions: Some connectors (Google Drive, SharePoint) support permissions/identity streams. Should this be part of the DSL?
  5. Incremental approach: Should we start with a minimal DSL that covers the common case (list + read files with simple auth) and extend over time?

Affected file-based connectors (all currently Python)

Connector Key complexity
source-google-drive OAuth + Service Account, exportable docs, permissions, shared drives
source-s3 IAM auth, bucket listing, multiple file formats
source-azure-blob-storage Connection string / SAS auth, container listing
source-gcs Service account auth, bucket listing
source-sftp-bulk SSH key / password auth, directory traversal
source-microsoft-onedrive OAuth, Graph API file listing
source-microsoft-sharepoint OAuth, Graph API, site/drive discovery, permissions

Related issues

CDK classes and modules involved

  • airbyte_cdk.sources.file_based.file_based_source.FileBasedSource — base class for all file-based sources
  • airbyte_cdk.sources.file_based.file_based_stream_reader.AbstractFileBasedStreamReader — abstract stream reader (3 abstract methods: config setter, open_file, get_matching_files)
  • airbyte_cdk.sources.file_based.file_based_stream_permissions_reader.AbstractFileBasedStreamPermissionsReader — abstract permissions reader
  • airbyte_cdk.sources.file_based.config.abstract_file_based_spec.AbstractFileBasedSpec — abstract config spec

Requested by Aaron ("AJ") Steers (@aaronsteers) (AJ Steers)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions