-
Notifications
You must be signed in to change notification settings - Fork 454
Description
Apache Iceberg version
0.11.0 (latest release)
Please describe the bug 🐞
Apologies, this is a bit of a fuzzy one right now, but I thought reporting it anyway.
Context:
We're using Iceberg with AWS Glue and AWS S3 as storage. In S3 there are roughly speaking 3 kinds of files (metadata, manifests, and data files). The first one that is read when loading a table via catalog.load_table() is the metadata file. The metadata file contains information on all current* snapshots and schema versions of the table. py-iceberg seems to load these completely into memory.
Issue:
As we worked on the Iceberg table, there were a lot of snapshots created over time and with that a lot of schema versions. This led to the latest metadata file to be grow to ~10MB gzip compressed (or ~250MB uncompressed JSON). When we load this table via catalog.load_table() it consumes ~4GB of memory (total usage of the python process in memray). This is a lot - especially since we only need the latest snapshot and the respective schema version. (Which is probably true for most users I guess.)
Semi-Workaround:
One could try to expire some snapshots, e.g. via Sparks expire_snapshots procedure [https://iceberg.apache.org/docs/1.10.0/spark-procedures/#expire_snapshots], but it will not get rid of the old / unused schemas unless you set clean_expired_metadata as well (which is only supported since 1.10.x, so relatively new).
(Preliminary) Root-Cause:
I believe the issue is that we leverage Pydantic's model_validate_json in
iceberg-python/pyiceberg/table/metadata.py
Line 663 in 44ce51a
| return TableMetadataWrapper.model_validate_json(data).root |
TableMetadata object around.
Suggestion:
Would it make sense to parse the JSON not fully into memory and load the needed snapshots and schemas lazy / on demand? (Would be also fine, if that is a configurable option of catalog.load_table())
Remark:
Obviously we could blame this on an un-maintained Iceberg table, but I think it would be good for the pyIceberg lib to be robust against such scenarios, hence why I opened the issue.
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time