From d4d2cf3f73bf5c19238d30e903965134f550a18c Mon Sep 17 00:00:00 2001 From: bubriks Date: Thu, 11 Jun 2026 15:34:44 +0300 Subject: [PATCH 1/2] init --- docs/user_guides/fs/feature_group/create.md | 11 ++++++----- docs/user_guides/fs/feature_view/batch-data.md | 2 +- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/user_guides/fs/feature_group/create.md b/docs/user_guides/fs/feature_group/create.md index c6db36f3ef..cf76b23d7d 100644 --- a/docs/user_guides/fs/feature_group/create.md +++ b/docs/user_guides/fs/feature_group/create.md @@ -57,7 +57,7 @@ When using `time_travel_format="HUDI"` in a Python environment this behavior is ##### Primary key -A primary key is required when using the default table format (Hudi or Delta) to store offline feature data. +A primary key is required when using the default table format (Hudi, Delta, or Iceberg) to store offline feature data. When inserting data in a feature group on the offline feature store, the DataFrame you are writing is checked against the existing data in the feature group. If a row with the same primary key is found in the feature group, the row will be updated. If the primary key is not found, the row is appended to the feature group. @@ -105,12 +105,13 @@ By using partitioning the system will write the feature data in different subdir ##### Table format When you create a feature group, you can specify the table format you want to use to store the data in your feature group by setting the `time_travel_format` parameter. -The currently supported values are `"HUDI"`, `"DELTA"`, and `"NONE"` (which stores as Parquet without time travel support). The parameter defaults to `None`, which resolves to `"DELTA"` if the `deltalake` package is installed, or `"HUDI"` otherwise. +The currently supported values are `"HUDI"`, `"DELTA"`, `"ICEBERG"`, and `"NONE"` (which stores as Parquet without time travel support). +The parameter defaults to `"DELTA"`. ##### Data Source During the creation of a feature group, it is possible to define the `data_source` parameter, this allows for management of offline data in the desired table format outside the Hopsworks cluster. -Currently, [S3](../data_source/creation/s3.md) and [GCS](../data_source/creation/gcs.md) connectors with "DELTA" `time_travel_format` are supported. +Currently, [S3](../data_source/creation/s3.md) and [GCS](../data_source/creation/gcs.md) connectors with `"DELTA"` or `"ICEBERG"` `time_travel_format` are supported. ##### Online Table Configuration @@ -221,7 +222,7 @@ Four main considerations influence the write and the query performance: ##### Partitioning on a feature group level -**Partitioning on the feature group level** allows Hopsworks and the table format (Hudi or Delta) to push down filters to the filesystem when reading from feature groups. +**Partitioning on the feature group level** allows Hopsworks and the table format (Hudi, Delta, or Iceberg) to push down filters to the filesystem when reading from feature groups. In practice that means fewer directories need to be listed and fewer files need to be read, speeding up queries. For example, most commonly, filtering is done on the event time column of a feature group when generating training data or batches of data: @@ -276,7 +277,7 @@ fg = feature_store.create_feature_group(... ##### Parquet file size within a feature group partition Once you have decided on the feature group level partitioning and you start inserting data to the feature group, there are multiple ways in order to -influence how the table format (Hudi or Delta) will **split the data between parquet files within the feature group partitions**. +influence how the table format (Hudi, Delta, or Iceberg) will **split the data between parquet files within the feature group partitions**. The two things that influence the number of parquet files per partition are 1. The number of feature group partitions written in a single insert diff --git a/docs/user_guides/fs/feature_view/batch-data.md b/docs/user_guides/fs/feature_view/batch-data.md index fb15079081..2f6db323b4 100644 --- a/docs/user_guides/fs/feature_view/batch-data.md +++ b/docs/user_guides/fs/feature_view/batch-data.md @@ -150,7 +150,7 @@ df = feature_view.get_batch_data( `key` selects which column the predicate is emitted against. `"PARTITION_KEY"` targets the Feature Group's partition column so the engine can prune partitions before reading files; the Feature Group must have a single DATE partition column. -`"EVENT_TIME"` targets the Feature Group's `event_time` column and guarantees row-level correctness but offers only engine-dependent file pruning (Hudi or Delta column-stats indexing). +`"EVENT_TIME"` targets the Feature Group's `event_time` column and guarantees row-level correctness but offers only engine-dependent file pruning (Hudi, Delta, or Iceberg column-stats indexing). `start` is required and emits a `>=` predicate. `end` is optional and emits a `<=` predicate when present. From bebe3e5c57f4c9c1aa5e5679483d949ad587fbaa Mon Sep 17 00:00:00 2001 From: bubriks Date: Fri, 12 Jun 2026 15:46:45 +0300 Subject: [PATCH 2/2] Address review feedback Co-Authored-By: Claude Fable 5 --- docs/user_guides/fs/data_source/creation/gcs.md | 2 +- docs/user_guides/fs/data_source/creation/s3.md | 2 +- docs/user_guides/fs/feature_group/create.md | 7 +++---- 3 files changed, 5 insertions(+), 6 deletions(-) diff --git a/docs/user_guides/fs/data_source/creation/gcs.md b/docs/user_guides/fs/data_source/creation/gcs.md index eb2e377401..a02e7721e8 100644 --- a/docs/user_guides/fs/data_source/creation/gcs.md +++ b/docs/user_guides/fs/data_source/creation/gcs.md @@ -1,4 +1,4 @@ -# How-To set up a GCS Data Source +# How-To set up a GCS Data Source { #data-source-gcs } ## Introduction diff --git a/docs/user_guides/fs/data_source/creation/s3.md b/docs/user_guides/fs/data_source/creation/s3.md index bb04cbeb60..8ead01da4b 100644 --- a/docs/user_guides/fs/data_source/creation/s3.md +++ b/docs/user_guides/fs/data_source/creation/s3.md @@ -1,4 +1,4 @@ -# How-To set up a S3 Data Source +# How-To set up a S3 Data Source { #data-source-s3 } ## Introduction diff --git a/docs/user_guides/fs/feature_group/create.md b/docs/user_guides/fs/feature_group/create.md index cf76b23d7d..d05d967fc8 100644 --- a/docs/user_guides/fs/feature_group/create.md +++ b/docs/user_guides/fs/feature_group/create.md @@ -57,7 +57,7 @@ When using `time_travel_format="HUDI"` in a Python environment this behavior is ##### Primary key -A primary key is required when using the default table format (Hudi, Delta, or Iceberg) to store offline feature data. +A primary key is required when using a table format with time travel support (Hudi, Delta, or Iceberg) to store offline feature data. When inserting data in a feature group on the offline feature store, the DataFrame you are writing is checked against the existing data in the feature group. If a row with the same primary key is found in the feature group, the row will be updated. If the primary key is not found, the row is appended to the feature group. @@ -111,7 +111,7 @@ The parameter defaults to `"DELTA"`. ##### Data Source During the creation of a feature group, it is possible to define the `data_source` parameter, this allows for management of offline data in the desired table format outside the Hopsworks cluster. -Currently, [S3](../data_source/creation/s3.md) and [GCS](../data_source/creation/gcs.md) connectors with `"DELTA"` or `"ICEBERG"` `time_travel_format` are supported. +Currently, [S3][data-source-s3] and [GCS][data-source-gcs] connectors with `"DELTA"` or `"ICEBERG"` `time_travel_format` are supported. ##### Online Table Configuration @@ -276,8 +276,7 @@ fg = feature_store.create_feature_group(... ##### Parquet file size within a feature group partition -Once you have decided on the feature group level partitioning and you start inserting data to the feature group, there are multiple ways in order to -influence how the table format (Hudi, Delta, or Iceberg) will **split the data between parquet files within the feature group partitions**. +Once you have decided on the feature group level partitioning and you start inserting data to the feature group, there are multiple ways in order to influence how the table format (Hudi, Delta, or Iceberg) will **split the data between parquet files within the feature group partitions**. The two things that influence the number of parquet files per partition are 1. The number of feature group partitions written in a single insert