[FSTORE-1938] Support chaining of Transformation Functions using a DAG#580
[FSTORE-1938] Support chaining of Transformation Functions using a DAG#580manu-sj wants to merge 4 commits into
Conversation
5ed6dcb to
b770050
Compare
6eacba8 to
cbf2ed3
Compare
ff87ced to
4db4444
Compare
There was a problem hiding this comment.
Pull request overview
Adds documentation for chaining Transformation Functions into a dependency graph (DAG) in the Hopsworks Feature Store docs, including how execution order is resolved, how to visualize the DAG, and how parallel execution behaves for independent branches.
Changes:
- Documented chaining semantics for Transformation Functions (ODT + MDT), including cycle/duplicate-output rejection behavior.
- Added guidance on visualizing the transformation execution DAG from UI and SDK.
- Added performance/parallelism tuning details via
n_processes, including defaults and serving-time pool pre-spawn.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| docs/user_guides/fs/transformation_functions.md | Introduces chained transformation DAG concept, DAG visualization, and performance tuning/parallelism behavior. |
| docs/user_guides/fs/feature_view/model-dependent-transformations.md | Adds a section describing chaining model-dependent transformations and links to performance tuning guidance. |
| docs/user_guides/fs/feature_group/on_demand_transformations.md | Adds a section describing chaining on-demand transformations and the cross-DAG path into feature views/MDTs. |
| A model-dependent transformation can consume another MDT's output as its input. | ||
| The DAG is resolved automatically at execution time, so producers always run before consumers. | ||
|
|
||
| !!! example "Chaining two normalizers and a sum" |
There was a problem hiding this comment.
Renamed to "Chaining two increments and a sum" to match the add_one/add code. Fixed in efcea35.
|
|
||
| ## Chaining Model-Dependent Transformations | ||
|
|
||
| A model-dependent transformation can consume another MDT's output as its input. |
There was a problem hiding this comment.
Defined on first use: "A model-dependent transformation (MDT) can consume another MDT's output". Fixed in efcea35.
| Hopsworks resolves the execution order automatically using a topological sort of the resulting DAG, so dependencies always run before their consumers. | ||
| Chaining works for both on-demand transformations attached to a feature group and model-dependent transformations attached to a feature view. | ||
|
|
||
| !!! example "Chained MDTs on a feature view" |
There was a problem hiding this comment.
Spelled out: "Chained model-dependent transformations on a feature view". Fixed in efcea35.
|
|
||
| ## Chaining On-Demand Transformations | ||
|
|
||
| On-demand transformations attached to the same feature group can be chained: one ODT's output column can serve as another ODT's input. |
There was a problem hiding this comment.
Defined on first use: "On-demand transformations (ODTs) attached to the same feature group". Fixed in efcea35.
| An intermediate output consumed only by a downstream ODT can be dropped from the feature group; the full chain still executes during online serving, and the dropped column never becomes a stored feature. | ||
|
|
||
| An ODT's output column becomes a regular feature in the feature group, which a downstream feature view can consume and pass into a model-dependent transformation. | ||
| This is the implicit cross-DAG path between ODT and MDT chains: nothing extra to configure on either side. |
There was a problem hiding this comment.
Spelled out: "between on-demand and model-dependent transformation chains". Fixed in efcea35.
…xecution DAG https://hopsworks.atlassian.net/browse/FSTORE-1938 Document chaining of transformation functions across the user guides: how the output of one function feeds another, how the execution DAG resolves the order, how cycles and duplicate output columns are rejected, and how the DAG is rendered from the UI and from the SDK with visualize_transformations(). A Transformation Functions Performance Tuning subsection in the transformation functions guide covers the node-parallel execution model: the n_processes argument and its defaults per input shape, pool pre-spawning through init_serving and init_batch_scoring, Arrow shared-memory staging, and the HSFS_TF_POOL_START_METHOD override. The model-dependent transformations guide notes that statistics for chained functions are fit in dependency order on the data each function sees. The on-demand transformations guide covers chains whose intermediate output is dropped from the feature group. No migration entry is included since the changes are backwards compatible. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@logicalclocks.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…xecution DAG https://hopsworks.atlassian.net/browse/FSTORE-1938 Restructure the performance tuning section so it reads in order: what the n_processes argument is, how parallelism maps to the DAG, when it pays off, online serving specifics, implementation notes. The previous version stated the sequential default three times across the first three paragraphs and placed the practical guidance after the implementation internals. Content changes: a call-shape distinction in the guidance (batch and offline calls benefit from worker processes, single feature vectors rarely do because the per-call dispatch cost usually exceeds the work), and a note that pre-spawning the pool removes the startup cost but not the per-call dispatch cost. Both reflect the measured behavior of the online batch chaining benchmark in the loadtest repository. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@hopsworks.ai> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xecution DAG https://hopsworks.atlassian.net/browse/FSTORE-1938 Rework the chaining documentation for reading order on all three pages. The hub page now flows what chaining is, example, uniform offline and online behavior, statistics over chains with a link to the model-dependent page, cross-type chaining, and invalid configurations last instead of interleaved. The model-dependent page gives the statistics-over-chains behavior its own subsection instead of a single dangling sentence after the example, and states that statistics are fit on the train split, each transformation executes once, and the fitted values are persisted for serving. The on-demand page leads with the example like the other pages, and the example now demonstrates the dropped-column claims it previously only stated: both the raw input and the intermediate are dropped, leaving one stored output. Signed-off-by: Manu Sathyarajan Joseph <manu.joseph@hopsworks.ai> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No description provided.