Skip to content

load_table consumes enormous amounts of memory on large metadata file #3162

@thomas-pfeiffer

Description

@thomas-pfeiffer

Apache Iceberg version

0.11.0 (latest release)

Please describe the bug 🐞

Apologies, this is a bit of a fuzzy one right now, but I thought reporting it anyway.

Context:
We're using Iceberg with AWS Glue and AWS S3 as storage. In S3 there are roughly speaking 3 kinds of files (metadata, manifests, and data files). The first one that is read when loading a table via catalog.load_table() is the metadata file. The metadata file contains information on all current* snapshots and schema versions of the table. py-iceberg seems to load these completely into memory.

Issue:
As we worked on the Iceberg table, there were a lot of snapshots created over time and with that a lot of schema versions. This led to the latest metadata file to be grow to ~10MB gzip compressed (or ~250MB uncompressed JSON). When we load this table via catalog.load_table() it consumes ~4GB of memory (total usage of the python process in memray). This is a lot - especially since we only need the latest snapshot and the respective schema version. (Which is probably true for most users I guess.)

Semi-Workaround:
One could try to expire some snapshots, e.g. via Sparks expire_snapshots procedure [https://iceberg.apache.org/docs/1.10.0/spark-procedures/#expire_snapshots], but it will not get rid of the old / unused schemas unless you set clean_expired_metadata as well (which is only supported since 1.10.x, so relatively new).

(Preliminary) Root-Cause:
I believe the issue is that we leverage Pydantic's model_validate_json in

return TableMetadataWrapper.model_validate_json(data).root
, which loads the whole JSON into memory and then we seem to keep the full TableMetadata object around.

Suggestion:
Would it make sense to parse the JSON not fully into memory and load the needed snapshots and schemas lazy / on demand? (Would be also fine, if that is a configurable option of catalog.load_table())

Remark:
Obviously we could blame this on an un-maintained Iceberg table, but I think it would be good for the pyIceberg lib to be robust against such scenarios, hence why I opened the issue.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions