Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions docs/user_guides/fs/feature_group/on_demand_transformations.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,3 +270,42 @@ On-demand transformation functions can also be accessed and executed as normal f
"on_demand_feature1"
](feature_vector["transaction_time"], datetime.now())
```

## Chaining On-Demand Transformations

On-demand transformations attached to the same feature group can be chained: one transformation's output column can serve as another transformation's input.
The execution order is resolved automatically, and the resulting DAG is visible from the feature group overview page in the Hopsworks UI.

!!! example "On-demand transformation that consumes an upstream output"
=== "Python"

```python
from hopsworks import udf


@udf(int, drop=["raw"])
def add_one(raw):
return raw + 1


@udf(int, drop=["col"])
def double(col):
return col * 2


fg = fs.create_feature_group(
name="chained_odt_fg",
version=1,
primary_key=["id"],
transformation_functions=[
add_one("raw").alias("raw_plus_one"),
double("raw_plus_one").alias("raw_plus_one_doubled"),
],
)
```

Columns consumed only by the chain can be dropped, as the raw input `raw` and the intermediate `raw_plus_one` are in the example, leaving `raw_plus_one_doubled` as the only stored output.
The full chain still executes during online serving, and dropped columns never become stored features.

An on-demand transformation's output column becomes a regular feature in the feature group, which a downstream feature view can consume and pass into a model-dependent transformation.
This is the implicit chaining path between on-demand and model-dependent transformations, with no additional setup on either side.
Original file line number Diff line number Diff line change
Expand Up @@ -175,3 +175,45 @@ To achieve this, set the `transform` parameter to False.
# Fetching untransformed batch data.
untransformed_batch_data = feature_view.get_batch_data(transform=False)
```

## Chaining Model-Dependent Transformations

A model-dependent transformation (MDT) can consume another MDT's output as its input.
The DAG is resolved automatically at execution time, so producers always run before consumers.

!!! example "Chaining two increments and a sum"
=== "Python"

```python
from hopsworks import udf


@udf(int)
def add_one(col):
return col + 1


@udf(int)
def add(a, b):
return a + b


fv = fs.create_feature_view(
name="chained_mdt_fv",
query=fg.select_all(),
transformation_functions=[
add_one("data1").alias("data1_plus_one"),
add_one("data2").alias("data2_plus_one"),
add("data1_plus_one", "data2_plus_one").alias("sum_plus_two"),
],
version=1,
)
```

### Statistics over chained transformations

Statistics-based transformations participate in chains like any other transformation.
A transformation that requires statistics on another transformation's output, such as a min-max scaler applied to an imputed column, is fit on that intermediate output rather than on the raw feature.
During training dataset creation the statistics are computed in dependency order on the train split, each transformation executes exactly once, and the fitted statistics are persisted so that online serving applies the same values.

See [Transformation Functions Performance Tuning][transformation-functions-performance-tuning] for `n_processes` semantics on chained DAGs.
137 changes: 110 additions & 27 deletions docs/user_guides/fs/transformation_functions.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

# Transformation Functions

In AI systems, [transformation functions](https://www.hopsworks.ai/dictionary/transformation) transform data to create features, the inputs to machine learning models (in both training and inference).
Expand Down Expand Up @@ -43,23 +42,23 @@ The decorator accepts three parameters:
It can be a single Python type if the function returns one transformed feature, or a list of Python types if it returns multiple transformed features.
The supported Python types that be used with the `return_type` argument are provided in the table below:

| Supported Python Types |
|:----------------------------------:|
| str |
| int |
| float |
| bool |
| datetime.datetime |
| datetime.date |
| datetime.time |
| Supported Python Types |
| :--------------------: |
| str |
| int |
| float |
| bool |
| datetime.datetime |
| datetime.date |
| datetime.time |

- **`drop`** (optional): Identifies input arguments to exclude from the output after transformations are applied.
By default, all inputs are retained in the output.
Further details on this argument can be found [below](#dropping-input-features).

- **`mode`** (optional): Determines the execution mode of the transformation function.
The argument accepts three values: `default`, `python`, or `pandas`.
By default, the `mode` is set to `default`. Further details on this argument can be found [below](#specifying-execution-modes).
By default, the `mode` is set to `default`. Further details on this argument can be found [below](#specifying-execution-modes).

Hopsworks supports four types of transformation functions across all execution modes:

Expand All @@ -74,7 +73,7 @@ To create a one-to-one transformation function, the Hopsworks `@udf` decorator m
The transformation function should take one argument as input and return a Pandas Series.

!!! example "Creation of a one-to-one transformation function in Hopsworks."
=== "Python"
=== "Python"

```python
from hopsworks import udf
Expand All @@ -90,7 +89,7 @@ The transformation function should take one argument as input and return a Panda
The creation of many-to-one transformation functions is similar to that of a one-to-one transformation function, the only difference being that the transformation function accepts multiple features as input.

!!! example "Creation of a many-to-one transformation function in Hopsworks."
=== "Python"
=== "Python"

```python
from hopsworks import udf
Expand All @@ -107,7 +106,7 @@ To create a one-to-many transformation function, the Hopsworks `@udf` decorato
The return types provided to the decorator must match the types of each column in the returned Pandas DataFrame.

!!! example "Creation of a one-to-many transformation function in Hopsworks."
=== "Python"
=== "Python"

```python
from hopsworks import udf
Expand All @@ -123,7 +122,7 @@ The return types provided to the decorator must match the types of each column i
The creation of a many-to-many transformation function is similar to that of a one-to-many transformation function, the only difference being that the transformation function accepts multiple features as input.

!!! example "Creation of a many-to-many transformation function in Hopsworks."
=== "Python"
=== "Python"

```python
from hopsworks import udf
Expand All @@ -137,7 +136,7 @@ The creation of a many-to-many transformation function is similar to that of a o
### Specifying execution modes

The `mode` parameter of the `@udf` decorator can be used to specify the execution mode of the transformation function.
It accepts three possible values `default`, `python` and `pandas`. Each mode is explained in more detail below:
It accepts three possible values `default`, `python` and `pandas`. Each mode is explained in more detail below:

#### Default Mode

Expand All @@ -146,7 +145,7 @@ It serves as the default mode used when the `mode` parameter is not specified.
In this mode, the transformation function is executed as a Pandas UDF during training and in the batch inference pipeline, while it operates as a Python UDF during online inference.

!!! example "Creating a many to many transformations function using the default execution mode"
=== "Python"
=== "Python"

```python
from hopsworks import udf
Expand All @@ -168,7 +167,7 @@ In this mode, the transformation function is executed as a Pandas UDF during tra
The transformation function can be configured to always execute as a Python UDF by setting the `mode` parameter of the `@udf` decorator to `python`.

!!! example "Creating a many to many transformation function as a Python UDF"
=== "Python"
=== "Python"

```python
from hopsworks import udf
Expand All @@ -184,7 +183,7 @@ The transformation function can be configured to always execute as a Python UDF
The transformation function can be configured to always execute as a Pandas UDF by setting the `mode` parameter of the `@udf` decorator to `pandas`.

!!! example "Creating a many to many transformations function as a Pandas UDF"
=== "Python"
=== "Python"

```python
import pandas as pd
Expand All @@ -211,11 +210,11 @@ The transformation function can be configured to always execute as a Pandas UDF

### Dropping input features

The `drop` parameter of the `@udf` decorator is used to drop specific columns in the input DataFrame after transformation. If any argument of the transformation function is passed to the `drop` parameter, then the column mapped to the argument is dropped after the transformation functions are applied.
The `drop` parameter of the `@udf` decorator is used to drop specific columns in the input DataFrame after transformation. If any argument of the transformation function is passed to the `drop` parameter, then the column mapped to the argument is dropped after the transformation functions are applied.
In the example below, the columns mapped to the arguments `feature1` and `feature3` are dropped after the application of all transformation functions.

!!! example "Specify arguments to drop after transformation"
=== "Python"
=== "Python"

```python
from hopsworks import udf
Expand All @@ -233,7 +232,7 @@ Each name must be uniques and should be at-most 63 characters long.
If no name is provided via the `alias` function, Hopsworks generates default output feature names when [on-demand](./feature_group/on_demand_transformations.md) or [model-dependent](./feature_view/model-dependent-transformations.md) transformation functions are created.

!!! example "Specifying output column names for transformation functions."
=== "Python"
=== "Python"

```python
from hopsworks import udf
Expand Down Expand Up @@ -266,7 +265,7 @@ These objects encapsulate statistics related to the argument as instances of the
Upon instantiation, instances of `FeatureTransformationStatistics` contain `None` values and are updated with the required statistics after the creation of a training dataset.

!!! example "Creation of a transformation function in Hopsworks that uses training dataset statistics"
=== "Python"
=== "Python"

```python
from hopsworks import udf
Expand Down Expand Up @@ -295,7 +294,7 @@ These variables contain common data used across transformation functions.
By including the context argument, you can pass the necessary data as a dictionary into the into the `context` argument of the transformation function during [training dataset creation](feature_view/training-data.md#passing-context-variables-to-transformation-functions) or [feature vector retrieval](feature_view/feature-vectors.md#passing-context-variables-to-transformation-functions) or [batch data retrieval](feature_view/batch-data.md#passing-context-variables-to-transformation-functions).

!!! example "Creation of a transformation function in Hopsworks that accepts context variables"
=== "Python"
=== "Python"

```python
from hopsworks import udf
Expand All @@ -312,7 +311,7 @@ To save a transformation function to the feature store, use the function `creat
The save function will throw an error if another transformation function with the same name and version is already saved in the feature store.

!!! example "Register transformation function `add_one` in the Hopsworks feature store"
=== "Python"
=== "Python"

```python
plus_one_meta = fs.create_transformation_function(
Expand All @@ -329,7 +328,7 @@ A specific transformation function can be retrieved using its `name` and `versio
If only the `name` is provided, then the version will default to 1.

!!! example "Retrieving transformation functions from the feature store"
=== "Python"
=== "Python"

```python
# get all transformation functions
Expand All @@ -344,4 +343,88 @@ If only the `name` is provided, then the version will default to 1.

## Using transformation functions

Transformation functions can be used by attaching it to a feature view to [create model-dependent transformations](./feature_view/model-dependent-transformations.md) or attached to feature groups to [create on-demand transformations](./feature_group/on_demand_transformations.md)
Transformation functions can be used by attaching it to a feature view to [create model-dependent transformations](./feature_view/model-dependent-transformations.md) or attached to feature groups to [create on-demand transformations](./feature_group/on_demand_transformations.md)

## Chained Transformation Functions

Transformation functions can be chained: the output column of one transformation function can serve as the input to another.
Hopsworks resolves the execution order automatically using a topological sort of the resulting DAG, so dependencies always run before their consumers.
Chaining works for both on-demand transformations attached to a feature group and model-dependent transformations attached to a feature view.

!!! example "Chained model-dependent transformations on a feature view"
=== "Python"

```python
from hopsworks import udf


@udf(int)
def add_one(col):
return col + 1


@udf(int)
def add(a, b):
return a + b


fv = fs.create_feature_view(
name="chained_mdts_fv",
query=fg.select_all(),
transformation_functions=[
add_one("data1").alias("data1_plus_one"),
add_one("data2").alias("data2_plus_one"),
add("data1_plus_one", "data2_plus_one").alias("sum_plus_two"),
],
version=1,
)
```

The same DAG drives offline training data generation and online feature vector retrieval, so chains apply uniformly across both paths.
Statistics-based transformations participate in chains too: a transformation that requires statistics on another transformation's output is fit on that intermediate output, as described in [model-dependent transformations][chaining-model-dependent-transformations].

Chaining also works across the two transformation types without additional setup: an on-demand transformation's output column becomes a feature in its feature group, which a feature view can consume and feed into a model-dependent transformation.

A configuration with no valid execution order is rejected: a duplicate output column or a cycle between transformation functions raises an error naming the offending functions, which can be fixed by renaming outputs with `.alias()`.

### Visualizing the execution DAG

The execution DAG is shown in the Hopsworks UI on the feature view and feature group overview pages under "Transformation execution DAG."
The same graph can be rendered from the SDK with `visualize_transformations()`, available on both feature views and feature groups.
It renders as a Mermaid flowchart in Jupyter and as text elsewhere.

!!! example "Visualizing transformation DAGs"
=== "Python"

```python
# Render both the model-dependent and on-demand DAGs.
fv.visualize_transformations()

# Render only the model-dependent DAG, top-to-bottom layout.
fv.visualize_transformations(kind="model_dependent", orient="TB")

# Render the on-demand DAG of a feature group.
fg.visualize_transformations()
```

### Transformation Functions Performance Tuning

Transformation functions execute sequentially unless the `n_processes` argument requests worker processes.
The argument is accepted by the feature view and feature group entry points that execute transformations, such as `get_feature_vector`, `get_feature_vectors`, `get_batch_data`, `training_data`, and `transform`.
Parallelism is strictly opt-in because whether the worker-pool overhead pays off depends on the cost of your transformation functions.

With more than one worker process, independent transformation functions in the DAG run concurrently, while a chained sequence always runs in dependency order. On the Spark engine `n_processes` is ignored because the whole DAG is pushed down to Spark, which distributes the work itself. For batch and offline calls such as `get_feature_vectors`, `get_batch_data`, and `training_data` with CPU-heavy functions benefit from `n_processes >= 2`, for vectorized Pandas UDFs on small inputs, sequential execution is at least as fast because the pool overhead dominates.

For online serving, spawning the worker pool during the first request would add the pool startup cost to that request's latency. Passing `n_processes` to `init_serving` or `init_batch_scoring` pre-spawns the pool at initialization time and makes that value the default for subsequent retrieval calls; an explicit `n_processes` on an individual call still takes precedence.

!!! example "Pre-spawning the worker pool for online serving"
=== "Python"

```python
fv.init_serving(training_dataset_version=1, n_processes=2)

# Served using the pool of two workers spawned at init time.
vector = fv.get_feature_vector(entry={"id": 1})
```

The worker pool start method defaults to `fork` on Linux and `spawn` on macOS and Windows. Set the `HOPSWORKS_TF_POOL_START_METHOD` environment variable to `fork`, `forkserver`, or `spawn` to override it.
Loading