load_table consumes enormous amounts of memory on large metadata file

### Apache Iceberg version

0.11.0 (latest release)

### Please describe the bug 🐞

Apologies, this is a bit of a fuzzy one right now, but I thought reporting it anyway.

**Context:**
We're using Iceberg with AWS Glue and AWS S3 as storage. In S3 there are roughly speaking 3 kinds of files (metadata, manifests, and data files). The first one that is read when loading a table via `catalog.load_table()` is the metadata file. The metadata file contains information on all current* snapshots and schema versions of the table. py-iceberg seems to load these completely into memory.

**Issue:**
As we worked on the Iceberg table, there were a lot of snapshots created over time and with that a lot of schema versions. This led to the latest metadata file to be grow to ~10MB gzip compressed (or ~250MB uncompressed JSON). When we load this table via `catalog.load_table()` it consumes ~4GB of memory (total usage of the python process in memray). This is a lot - especially since we only need the latest snapshot and the respective schema version. (Which is probably true for most users I guess.)

**Semi-Workaround:**
One could try to expire some snapshots, e.g. via Sparks `expire_snapshots` procedure [https://iceberg.apache.org/docs/1.10.0/spark-procedures/#expire_snapshots], but it will not get rid of the old / unused schemas unless you set `clean_expired_metadata` as well (which is only supported since 1.10.x, so relatively new).

**(Preliminary) Root-Cause:**
I believe the issue is that we leverage Pydantic's `model_validate_json` in https://github.com/apache/iceberg-python/blob/44ce51a939ccbacf9c87ce6593ad43a752b0871b/pyiceberg/table/metadata.py#L663, which loads the whole JSON into memory and then we seem to keep the full `TableMetadata` object around.

**Suggestion:**
Would it make sense to parse the JSON not fully into memory and load the needed snapshots and schemas lazy / on demand? (Would be also fine, if that is a configurable option of `catalog.load_table()`)

**Remark:**
Obviously we could blame this on an un-maintained Iceberg table, but I think it would be good for the pyIceberg lib to be robust against such scenarios, hence why I opened the issue. 

### Willingness to contribute

- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_table consumes enormous amounts of memory on large metadata file #3162

Apache Iceberg version

Please describe the bug 🐞

Willingness to contribute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

load_table consumes enormous amounts of memory on large metadata file #3162

Description

Apache Iceberg version

Please describe the bug 🐞

Willingness to contribute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions