Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/user_guides/fs/data_source/creation/gcs.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# How-To set up a GCS Data Source
# How-To set up a GCS Data Source { #data-source-gcs }

## Introduction

Expand Down
2 changes: 1 addition & 1 deletion docs/user_guides/fs/data_source/creation/s3.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# How-To set up a S3 Data Source
# How-To set up a S3 Data Source { #data-source-s3 }

## Introduction

Expand Down
12 changes: 6 additions & 6 deletions docs/user_guides/fs/feature_group/create.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ When using `time_travel_format="HUDI"` in a Python environment this behavior is

##### Primary key

A primary key is required when using the default table format (Hudi or Delta) to store offline feature data.
A primary key is required when using a table format with time travel support (Hudi, Delta, or Iceberg) to store offline feature data.
When inserting data in a feature group on the offline feature store, the DataFrame you are writing is checked against the existing data in the feature group.
If a row with the same primary key is found in the feature group, the row will be updated.
If the primary key is not found, the row is appended to the feature group.
Expand Down Expand Up @@ -105,12 +105,13 @@ By using partitioning the system will write the feature data in different subdir
##### Table format

When you create a feature group, you can specify the table format you want to use to store the data in your feature group by setting the `time_travel_format` parameter.
The currently supported values are `"HUDI"`, `"DELTA"`, and `"NONE"` (which stores as Parquet without time travel support). The parameter defaults to `None`, which resolves to `"DELTA"` if the `deltalake` package is installed, or `"HUDI"` otherwise.
The currently supported values are `"HUDI"`, `"DELTA"`, `"ICEBERG"`, and `"NONE"` (which stores as Parquet without time travel support).
The parameter defaults to `"DELTA"`.

##### Data Source

During the creation of a feature group, it is possible to define the `data_source` parameter, this allows for management of offline data in the desired table format outside the Hopsworks cluster.
Currently, [S3](../data_source/creation/s3.md) and [GCS](../data_source/creation/gcs.md) connectors with "DELTA" `time_travel_format` are supported.
Currently, [S3][data-source-s3] and [GCS][data-source-gcs] connectors with `"DELTA"` or `"ICEBERG"` `time_travel_format` are supported.

##### Online Table Configuration

Expand Down Expand Up @@ -221,7 +222,7 @@ Four main considerations influence the write and the query performance:

##### Partitioning on a feature group level

**Partitioning on the feature group level** allows Hopsworks and the table format (Hudi or Delta) to push down filters to the filesystem when reading from feature groups.
**Partitioning on the feature group level** allows Hopsworks and the table format (Hudi, Delta, or Iceberg) to push down filters to the filesystem when reading from feature groups.
In practice that means fewer directories need to be listed and fewer files need to be read, speeding up queries.

For example, most commonly, filtering is done on the event time column of a feature group when generating training data or batches of data:
Expand Down Expand Up @@ -275,8 +276,7 @@ fg = feature_store.create_feature_group(...

##### Parquet file size within a feature group partition

Once you have decided on the feature group level partitioning and you start inserting data to the feature group, there are multiple ways in order to
influence how the table format (Hudi or Delta) will **split the data between parquet files within the feature group partitions**.
Once you have decided on the feature group level partitioning and you start inserting data to the feature group, there are multiple ways in order to influence how the table format (Hudi, Delta, or Iceberg) will **split the data between parquet files within the feature group partitions**.
The two things that influence the number of parquet files per partition are

1. The number of feature group partitions written in a single insert
Expand Down
2 changes: 1 addition & 1 deletion docs/user_guides/fs/feature_view/batch-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ df = feature_view.get_batch_data(

`key` selects which column the predicate is emitted against.
`"PARTITION_KEY"` targets the Feature Group's partition column so the engine can prune partitions before reading files; the Feature Group must have a single DATE partition column.
`"EVENT_TIME"` targets the Feature Group's `event_time` column and guarantees row-level correctness but offers only engine-dependent file pruning (Hudi or Delta column-stats indexing).
`"EVENT_TIME"` targets the Feature Group's `event_time` column and guarantees row-level correctness but offers only engine-dependent file pruning (Hudi, Delta, or Iceberg column-stats indexing).

`start` is required and emits a `>=` predicate.
`end` is optional and emits a `<=` predicate when present.
Expand Down
Loading