diff --git a/docs/language/reference/functions/format.md b/docs/language/reference/functions/format.md new file mode 100644 index 0000000..568e349 --- /dev/null +++ b/docs/language/reference/functions/format.md @@ -0,0 +1,31 @@ +# Format Functions (Reference) + +Format helpers operate on scalar payloads that are already present in a relation. They do not read files, infer source +schemas from external locations, or change relation cardinality. + +The current implemented slice is deterministic string hashing: + +| Function | Meaning | +| --- | --- | +| `md5(expr)` | Return the lowercase hexadecimal MD5 digest for one string expression. | +| `sha224(expr)` | Return the lowercase hexadecimal SHA-224 digest for one string expression. | +| `sha256(expr)` | Return the lowercase hexadecimal SHA-256 digest for one string expression. | +| `sha384(expr)` | Return the lowercase hexadecimal SHA-384 digest for one string expression. | +| `sha512(expr)` | Return the lowercase hexadecimal SHA-512 digest for one string expression. | +| `sha2(expr, bit_length)` | Compatibility helper that rewrites to `sha224`, `sha256`, `sha384`, or `sha512` for supported literal bit lengths. | + +```incan +from pub::inql.functions import col, md5, sha2 + +projected = ( + events + .with_column("user_hash", sha2(col("user_id"), 256)) + .with_column("payload_md5", md5(col("payload"))) +) +``` + +Hash helpers operate on UTF-8 string bytes and return lowercase hexadecimal strings. `sha2(...)` accepts `224`, `256`, +`384`, and `512`; unsupported digest lengths are rejected by the helper rather than being passed through to a backend. + +JSON, CSV, URL, and dynamic-value predicate helpers remain future format-function slices until their schema arguments, +option records, path validation rules, and dynamic value model are specified. diff --git a/docs/language/reference/functions/index.md b/docs/language/reference/functions/index.md index adc070d..23ec6d0 100644 --- a/docs/language/reference/functions/index.md +++ b/docs/language/reference/functions/index.md @@ -10,10 +10,11 @@ Today the concrete shipped surfaces are documented here: - [Generator and table-valued functions](generators.md) - [Nested data functions](nested.md) - [Window functions](windows.md) +- [Format functions](format.md) The canonical scalar literal helper is `lit(...)`. Typed literal helpers construct the same scalar-expression representation. -The current registry-backed helper surface is registered in the package-owned function registry. Registry types live in `src/function_registry.incn`, the shared package registry lives in `src/functions/registry.incn`, and concrete public helper entries are produced by `function_registry.add(...)` decorators in individual `src/functions//.incn` modules. The registry-backed families are references, literals, casts, operators, predicates, conditionals, math, ordering, aggregates, generators, nested data, and windows. Each runtime entry exposes a stable function reference such as `inql.functions.col`, namespace, canonical name, typed lifecycle metadata (`since`, versioned changes, and optional deprecation), RFC 024 policy category, function class, null behavior, alias policy, aggregate modifier policy, and Substrait mapping metadata. Checked function signatures come from the public helper declaration, not from a second hand-written registry signature. +The current registry-backed helper surface is registered in the package-owned function registry. Registry types live in `src/function_registry.incn`, the shared package registry lives in `src/functions/registry.incn`, and concrete public helper entries are produced by `function_registry.add(...)` decorators in individual `src/functions//.incn` modules. The registry-backed families are references, literals, casts, operators, predicates, conditionals, math, ordering, aggregates, generators, nested data, windows, and format helpers. Each runtime entry exposes a stable function reference such as `inql.functions.col`, namespace, canonical name, typed lifecycle metadata (`since`, versioned changes, and optional deprecation), RFC 024 policy category, function class, null behavior, alias policy, aggregate modifier policy, and Substrait mapping metadata. Checked function signatures come from the public helper declaration, not from a second hand-written registry signature. The registry is the source for non-derivable machine facts. Public helper declarations are the source for argument names, argument types, and return types. Docstrings remain human-facing explanation, examples, and parameter intent. The `registry-metadata` check validates the checked API metadata projections produced from public facade aliases, registry decorators, and decorated callable signatures. Runtime registry entries are lazy and process-local: they support helper execution and lowering for loaded helpers, while the complete public catalog comes from checked metadata. This matters for generated docs, diagnostics, Prism lowering, and backend capability checks as the catalog grows. @@ -37,6 +38,7 @@ The registered helper surface currently includes: | `array(...)`, `cardinality(...)`, `array_contains(...)`, `arrays_overlap(...)`, `array_position(...)`, `element_at(...)`, `array_sort(...)`, `array_distinct(...)`, `array_except(...)`, `array_intersect(...)`, `array_union(...)`, `array_join(...)`, `array_slice(...)`, `array_reverse(...)`, `array_flatten(...)`, `map_from_arrays(...)`, `map_extract(...)`, `map_contains_key(...)`, `map_keys(...)`, `map_values(...)`, `map_entries(...)`, `named_struct(...)` | scalar | registered nested scalar helpers backed by Substrait extension mappings; `map_contains_key(...)` lowers as a documented predicate rewrite | | `explode(...)`, `explode_outer(...)`, `posexplode(...)`, `posexplode_outer(...)` | generator | relation-extension mappings consumed by `generate(...)`; positional forms use zero-based positions | | `window()`, `row_number()`, `rank()`, `dense_rank()` | window | `window()` builds structural window-spec metadata; ranking helpers lower through `ConsistentPartitionWindowRel` when placed with `with_window_column(...)` | +| `md5(...)`, `sha224(...)`, `sha256(...)`, `sha384(...)`, `sha512(...)`, `sha2(...)` | scalar | registered format/hash helpers; concrete helpers lower through Substrait extension mappings, while `sha2(...)` rewrites to a supported concrete SHA-2 helper | | `asc(...)`, `desc(...)`, `asc_nulls_first(...)`, `asc_nulls_last(...)`, `desc_nulls_first(...)`, `desc_nulls_last(...)` | ordering | structural sort-field helpers consumed by `order_by(...)` and lowered to Substrait `SortRel.sorts` | | `sum(...)`, `count()`, `count_expr(...)`, `avg(...)`, `min(...)`, `max(...)` | aggregate | registered Substrait extension functions; `count_expr(...)` is a compatibility spelling for future `count(expr)` helper overloading | | `count_distinct(...)`, `count_if(...)` | aggregate | compatibility helpers that lower through aggregate modifiers over canonical `count` semantics | diff --git a/docs/release_notes/v0_1.md b/docs/release_notes/v0_1.md index d337e4e..9be3264 100644 --- a/docs/release_notes/v0_1.md +++ b/docs/release_notes/v0_1.md @@ -18,6 +18,7 @@ Entries will be filled in as work lands (link RFCs and PRs when applicable). - **Nested data functions:** RFC 020 adds registry-backed scalar helpers for array construction/access, cardinality, containment, overlap, sorting, set-like operations, joining, slicing, reversing, scalar array flattening, map construction/access, map key/value/entry extraction, map key containment, and named struct construction. These helpers lower through Substrait extension metadata and execute through the DataFusion-backed Session path without introducing generator semantics. - **Generator functions:** RFC 021 adds registry-backed generator applications for `explode(...)`, `explode_outer(...)`, `posexplode(...)`, and `posexplode_outer(...)`. Generators remain relation-shaping operations applied with `generate(...)`; they preserve input columns, require explicit output aliases, and lower through the current Substrait extension-relation gap encoding. - **Window functions:** RFC 019 adds the first window-function planning slice with `window()` specs, `row_number()`, `rank()`, `dense_rank()`, and `with_window_column(...)`. Ranking windows require explicit ordering and lower through Substrait `ConsistentPartitionWindowRel`; backend execution support remains a separate adapter capability. +- **Format functions:** RFC 022 adds the first deterministic hashing slice with `md5(...)`, `sha224(...)`, `sha256(...)`, `sha384(...)`, `sha512(...)`, and `sha2(...)`. Hash helpers operate on UTF-8 string bytes, return lowercase hexadecimal strings, lower through registry-owned Substrait metadata, and execute through the DataFusion-backed Session path. - **Function registry:** RFC 014 adds declaration-site registry decorators for the current public helper surface, including stable function references, checked signature projection, lifecycle metadata, behavior categories, alias policy, Substrait mapping categories, and checked API metadata drift validation. - **Function extension policy:** RFC 024 policy metadata now distinguishes portable core functions, namespaced extension-only functions, opt-in compatibility aliases, engine-specific functions, and rejected compatibility requests without adding an extension plugin system or backend-owned semantics. - **Projection:** builder-based `with_column`, `add`, `mul`, and literal expression helpers now lower derived columns through Prism, Substrait, and Session execution. diff --git a/docs/rfcs/022_semi_structured_format_functions.md b/docs/rfcs/022_semi_structured_format_functions.md index 8df68b7..a07650d 100644 --- a/docs/rfcs/022_semi_structured_format_functions.md +++ b/docs/rfcs/022_semi_structured_format_functions.md @@ -1,6 +1,6 @@ # InQL RFC 022: Semi-structured and format functions -- **Status:** Draft +- **Status:** In Progress - **Created:** 2026-04-27 - **Author(s):** Danny Meijer (@dannymeijer) - **Related:** @@ -12,7 +12,7 @@ - InQL RFC 020 (nested data functions) - **Issue:** [InQL #39](https://github.com/dannys-code-corner/InQL/issues/39) - **RFC PR:** — -- **Written against:** Incan v0.2 +- **Written against:** Incan v0.3-era InQL - **Shipped in:** — ## Summary @@ -115,12 +115,41 @@ This RFC is additive. It should not change existing CSV ingestion behavior. - **Execution / interchange** — Prism and Substrait lowering must preserve parser options, hash encodings, and structured return values or diagnose unsupported functions. - **Documentation** — docs should distinguish scalar format functions from session read/write APIs. -## Unresolved questions +## Design Decisions + +### Resolved + +- The first implementation slice is deterministic hashing. JSON, CSV, URL, dynamic-value predicates, and structured parser helpers remain future slices because their schema arguments, option records, path validation, and dynamic value model are not settled here. +- Hash helpers in this slice operate on UTF-8 string bytes and return lowercase hexadecimal strings. +- Portable concrete hash helpers are `md5`, `sha224`, `sha256`, `sha384`, and `sha512`, each with an honest Substrait extension mapping and DataFusion-backed execution coverage. +- `sha2(expr, bit_length)` is a compatibility helper, not a separate backend mapping. It rewrites to `sha224`, `sha256`, `sha384`, or `sha512` for supported literal bit lengths and rejects unsupported values. +- `sha1`, `crc32`, and `xxhash64` are not implemented in the first slice because no honest Substrait/DataFusion mapping was validated for this branch. + +### Remaining - Should `from_json` accept model types directly as schema arguments, or only explicit schema values? - Should invalid JSON path expressions be compile-time errors when literal and runtime errors otherwise? - What option-record shape should CSV and JSON scalar parsers use? -- Should hash functions return binary values or lowercase hexadecimal strings by default? +- Should future binary-oriented hash helpers return binary values, lowercase hexadecimal strings, or an explicit typed encoding wrapper? - Which variant-style type predicates are portable enough for InQL core, and which should stay in a Snowflake-compatibility extension? - +## Implementation Plan + +1. Add registry-backed hashing helpers under a logical function family. +2. Add stable Substrait extension anchors for concrete hash helpers. +3. Keep `sha2(...)` as a compatibility rewrite over concrete helpers rather than a second mapping. +4. Add focused helper, registry, Substrait lowering, and DataFusion session tests with concrete digest values. +5. Add user-facing format-function docs and release notes. +6. Leave parser, URL, and dynamic-value helpers for later RFC 022 slices once their remaining design questions are resolved. + +## Progress Checklist + +- [x] RFC 022 moved to In Progress with a first implementation slice and recorded design decisions. +- [x] `md5`, `sha224`, `sha256`, `sha384`, `sha512`, and `sha2` helpers added under the function catalog. +- [x] Concrete hash helpers registered with Substrait extension metadata. +- [x] `sha2(...)` implemented as a literal-bit-length rewrite with invalid-input diagnostics. +- [x] Focused helper, registry, Substrait lowering, and DataFusion-backed session tests added. +- [x] User-facing format-function docs and release notes added. +- [ ] JSON and CSV scalar parser helpers specified and implemented. +- [ ] URL helper semantics specified and implemented. +- [ ] Dynamic-value predicate semantics specified and implemented. diff --git a/docs/rfcs/README.md b/docs/rfcs/README.md index 4ec010f..290c162 100644 --- a/docs/rfcs/README.md +++ b/docs/rfcs/README.md @@ -28,7 +28,7 @@ InQL uses its **own** RFC series (starting at 000), independent of the [Incan la | [019][rfc-019] | In Progress | Window functions | | | [020][rfc-020] | Draft | Nested data functions | | | [021][rfc-021] | In Progress | Generator and table-valued functions | | -| [022][rfc-022] | Draft | Semi-structured and format functions | | +| [022][rfc-022] | In Progress | Semi-structured and format functions | | | [023][rfc-023] | Draft | Approximate and sketch functions | | | [024][rfc-024] | Draft | Function extension policy | | diff --git a/src/functions/hashing/md5.incn b/src/functions/hashing/md5.incn new file mode 100644 index 0000000..6d4b1cc --- /dev/null +++ b/src/functions/hashing/md5.incn @@ -0,0 +1,51 @@ +""" +MD5 hash helper. + +`md5` hashes a string expression and returns its lowercase hexadecimal digest. +""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import function_registry, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import MD5_FUNCTION_ANCHOR + + +@function_registry.add("md5", deterministic_spec( + FunctionClass.Scalar, + FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + FunctionNullBehavior.DependsOnInputs, + extension_mapping("md5", MD5_FUNCTION_ANCHOR), +)) +pub def md5(expr: ColumnExpr) -> ColumnExpr: + """ + Build an MD5 hexadecimal digest expression. + + Examples: + user_digest = md5(col("user_id")) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + """ + return registered_application("md5", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_md5_builds_registered_application() -> None: + expr = md5(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "md5" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/hashing/sha2.incn b/src/functions/hashing/sha2.incn new file mode 100644 index 0000000..154d424 --- /dev/null +++ b/src/functions/hashing/sha2.incn @@ -0,0 +1,76 @@ +""" +SHA-2 compatibility helper. + +`sha2(expr, bits)` rewrites to the matching concrete SHA-2 helper for supported digest lengths. +""" + +from rust::incan_stdlib::errors import raise_value_error +from function_registry import ( + FunctionClass, + FunctionDeterminism, + FunctionErrorBehavior, + FunctionLifecycle, + FunctionNullBehavior, + compatibility_alias_spec, + core_function_namespace, + rewrite_mapping, + v0_1, +) +from functions.hashing.sha224 import sha224 +from functions.hashing.sha256 import sha256 +from functions.hashing.sha384 import sha384 +from functions.hashing.sha512 import sha512 +from functions.registry import function_registry +from projection_builders import ColumnExpr + + +@function_registry.add("sha2", compatibility_alias_spec( + core_function_namespace(), + FunctionClass.Scalar, + ["sha224", "sha256", "sha384", "sha512"], + FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + FunctionDeterminism.Deterministic, + FunctionNullBehavior.DependsOnInputs, + FunctionErrorBehavior.InvalidInputDiagnostic, + rewrite_mapping("sha2(expr, bits) -> sha224/sha256/sha384/sha512(expr) for supported literal bit lengths"), +)) +pub def sha2(expr: ColumnExpr, bit_length: int) -> ColumnExpr: + """ + Build a SHA-2 hexadecimal digest expression for a supported digest length. + + Examples: + user_digest = sha2(col("user_id"), 256) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + bit_length: Supported digest size: 224, 256, 384, or 512. + """ + if bit_length == 224: + return sha224(expr) + if bit_length == 256: + return sha256(expr) + if bit_length == 384: + return sha384(expr) + if bit_length == 512: + return sha512(expr) + return raise_value_error("sha2 bit_length must be one of 224, 256, 384, or 512") + + +module tests: + from std.testing import assert_raises + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_sha2_rewrites_to_supported_sha2_helper() -> None: + expr = sha2(col("payload"), 256) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "sha256" + assert column_expr_argument_count(expr) == 1 + def _call_sha2_with_unsupported_length() -> None: + sha2(col("payload"), 1) + def test_sha2_rejects_unsupported_bit_length() -> None: + assert_raises[ValueError](_call_sha2_with_unsupported_length) diff --git a/src/functions/hashing/sha224.incn b/src/functions/hashing/sha224.incn new file mode 100644 index 0000000..4b209d1 --- /dev/null +++ b/src/functions/hashing/sha224.incn @@ -0,0 +1,51 @@ +""" +SHA-224 hash helper. + +`sha224` hashes a string expression and returns its lowercase hexadecimal digest. +""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import function_registry, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import SHA224_FUNCTION_ANCHOR + + +@function_registry.add("sha224", deterministic_spec( + FunctionClass.Scalar, + FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + FunctionNullBehavior.DependsOnInputs, + extension_mapping("sha224", SHA224_FUNCTION_ANCHOR), +)) +pub def sha224(expr: ColumnExpr) -> ColumnExpr: + """ + Build a SHA-224 hexadecimal digest expression. + + Examples: + payload_digest = sha224(col("payload")) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + """ + return registered_application("sha224", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_sha224_builds_registered_application() -> None: + expr = sha224(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "sha224" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/hashing/sha256.incn b/src/functions/hashing/sha256.incn new file mode 100644 index 0000000..32d0963 --- /dev/null +++ b/src/functions/hashing/sha256.incn @@ -0,0 +1,51 @@ +""" +SHA-256 hash helper. + +`sha256` hashes a string expression and returns its lowercase hexadecimal digest. +""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import function_registry, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import SHA256_FUNCTION_ANCHOR + + +@function_registry.add("sha256", deterministic_spec( + FunctionClass.Scalar, + FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + FunctionNullBehavior.DependsOnInputs, + extension_mapping("sha256", SHA256_FUNCTION_ANCHOR), +)) +pub def sha256(expr: ColumnExpr) -> ColumnExpr: + """ + Build a SHA-256 hexadecimal digest expression. + + Examples: + payload_digest = sha256(col("payload")) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + """ + return registered_application("sha256", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_sha256_builds_registered_application() -> None: + expr = sha256(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "sha256" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/hashing/sha384.incn b/src/functions/hashing/sha384.incn new file mode 100644 index 0000000..c7afab1 --- /dev/null +++ b/src/functions/hashing/sha384.incn @@ -0,0 +1,51 @@ +""" +SHA-384 hash helper. + +`sha384` hashes a string expression and returns its lowercase hexadecimal digest. +""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import function_registry, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import SHA384_FUNCTION_ANCHOR + + +@function_registry.add("sha384", deterministic_spec( + FunctionClass.Scalar, + FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + FunctionNullBehavior.DependsOnInputs, + extension_mapping("sha384", SHA384_FUNCTION_ANCHOR), +)) +pub def sha384(expr: ColumnExpr) -> ColumnExpr: + """ + Build a SHA-384 hexadecimal digest expression. + + Examples: + payload_digest = sha384(col("payload")) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + """ + return registered_application("sha384", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_sha384_builds_registered_application() -> None: + expr = sha384(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "sha384" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/hashing/sha512.incn b/src/functions/hashing/sha512.incn new file mode 100644 index 0000000..193fe54 --- /dev/null +++ b/src/functions/hashing/sha512.incn @@ -0,0 +1,51 @@ +""" +SHA-512 hash helper. + +`sha512` hashes a string expression and returns its lowercase hexadecimal digest. +""" + +from function_registry import ( + FunctionClass, + FunctionLifecycle, + FunctionNullBehavior, + deterministic_spec, + extension_mapping, + v0_1, +) +from functions.registry import function_registry, registered_application +from projection_builders import ColumnExpr +from substrait.function_extensions import SHA512_FUNCTION_ANCHOR + + +@function_registry.add("sha512", deterministic_spec( + FunctionClass.Scalar, + FunctionLifecycle(since=v0_1, changed=[], deprecated=None), + FunctionNullBehavior.DependsOnInputs, + extension_mapping("sha512", SHA512_FUNCTION_ANCHOR), +)) +pub def sha512(expr: ColumnExpr) -> ColumnExpr: + """ + Build a SHA-512 hexadecimal digest expression. + + Examples: + payload_digest = sha512(col("payload")) + + Parameters: + expr: String expression whose UTF-8 bytes should be hashed. + """ + return registered_application("sha512", [expr]) + + +module tests: + from projection_builders import ( + ColumnExprKind, + col, + column_expr_argument_count, + column_expr_function_name, + column_expr_kind, + ) + def test_sha512_builds_registered_application() -> None: + expr = sha512(col("payload")) + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction + assert column_expr_function_name(expr) == "sha512" + assert column_expr_argument_count(expr) == 1 diff --git a/src/functions/mod.incn b/src/functions/mod.incn index 1cfc03c..6652a8f 100644 --- a/src/functions/mod.incn +++ b/src/functions/mod.incn @@ -69,6 +69,12 @@ pub from functions.windows.window import window pub from functions.windows.row_number import row_number pub from functions.windows.rank import rank pub from functions.windows.dense_rank import dense_rank +pub from functions.hashing.md5 import md5 +pub from functions.hashing.sha2 import sha2 +pub from functions.hashing.sha224 import sha224 +pub from functions.hashing.sha256 import sha256 +pub from functions.hashing.sha384 import sha384 +pub from functions.hashing.sha512 import sha512 pub from functions.operators.add import add pub from functions.operators.and_ import and_ pub from functions.operators.div import div diff --git a/src/lib.incn b/src/lib.incn index a707823..2b6670e 100644 --- a/src/lib.incn +++ b/src/lib.incn @@ -115,6 +115,12 @@ pub from functions.windows.window import window pub from functions.windows.row_number import row_number pub from functions.windows.rank import rank pub from functions.windows.dense_rank import dense_rank +pub from functions.hashing.md5 import md5 +pub from functions.hashing.sha2 import sha2 +pub from functions.hashing.sha224 import sha224 +pub from functions.hashing.sha256 import sha256 +pub from functions.hashing.sha384 import sha384 +pub from functions.hashing.sha512 import sha512 pub from functions.operators.add import add pub from functions.operators.and_ import and_ pub from functions.operators.div import div diff --git a/src/substrait/function_extensions.incn b/src/substrait/function_extensions.incn index 72e5d5f..fa9cfed 100644 --- a/src/substrait/function_extensions.incn +++ b/src/substrait/function_extensions.incn @@ -79,6 +79,11 @@ pub const ARRAY_FLATTEN_FUNCTION_ANCHOR: u32 = 51 pub const ROW_NUMBER_FUNCTION_ANCHOR: u32 = 52 pub const RANK_FUNCTION_ANCHOR: u32 = 53 pub const DENSE_RANK_FUNCTION_ANCHOR: u32 = 54 +pub const MD5_FUNCTION_ANCHOR: u32 = 55 +pub const SHA224_FUNCTION_ANCHOR: u32 = 56 +pub const SHA256_FUNCTION_ANCHOR: u32 = 57 +pub const SHA384_FUNCTION_ANCHOR: u32 = 58 +pub const SHA512_FUNCTION_ANCHOR: u32 = 59 const FUNCTION_EXTENSION_URI: str = "https://inql.io/extensions/v0.1/functions.yaml" const EXPLODE_EXTENSION_URI: str = "https://inql.io/extensions/v0.1/unnest.yaml#explode" const EXPLODE_OUTER_EXTENSION_URI: str = "https://inql.io/extensions/v0.1/unnest.yaml#explode_outer" diff --git a/tests/test_function_registry.incn b/tests/test_function_registry.incn index 23902fa..197c275 100644 --- a/tests/test_function_registry.incn +++ b/tests/test_function_registry.incn @@ -74,6 +74,7 @@ from functions import ( map_keys, map_values, max, + md5, min, modulo, mul, @@ -90,6 +91,11 @@ from functions import ( rank, round, row_number, + sha2, + sha224, + sha256, + sha384, + sha512, str_expr, str_lit, sub, @@ -161,6 +167,7 @@ from substrait.function_extensions import ( MAP_KEYS_FUNCTION_ANCHOR, MAP_VALUES_FUNCTION_ANCHOR, MAX_FUNCTION_ANCHOR, + MD5_FUNCTION_ANCHOR, MIN_FUNCTION_ANCHOR, MODULUS_FUNCTION_ANCHOR, MULTIPLY_FUNCTION_ANCHOR, @@ -173,6 +180,10 @@ from substrait.function_extensions import ( RANK_FUNCTION_ANCHOR, ROW_NUMBER_FUNCTION_ANCHOR, ROUND_FUNCTION_ANCHOR, + SHA224_FUNCTION_ANCHOR, + SHA256_FUNCTION_ANCHOR, + SHA384_FUNCTION_ANCHOR, + SHA512_FUNCTION_ANCHOR, SUBTRACT_FUNCTION_ANCHOR, SUM_FUNCTION_ANCHOR, explode_extension_uri, @@ -238,12 +249,12 @@ def _local_entry_by_namespace_and_name_or_fail( def _expected_registry_names() -> list[str]: """Return the expected registered public helper names.""" - return ["col", "lit", "sum", "count", "count_expr", "count_distinct", "count_if", "avg", "min", "max", "int_expr", "float_expr", "str_expr", "bool_expr", "add", "mul", "int_lit", "str_lit", "bool_lit", "always_true", "always_false", "eq", "gt", "cast", "try_cast", "ne", "lt", "lte", "gte", "equal_null", "and_", "or_", "not_", "is_null", "is_not_null", "is_nan", "is_not_nan", "sub", "div", "mod", "neg", "coalesce", "nullif", "case_when", "in_", "between", "asc", "desc", "asc_nulls_first", "asc_nulls_last", "desc_nulls_first", "desc_nulls_last", "abs", "ceil", "floor", "round", "array", "array_contains", "array_distinct", "array_except", "array_flatten", "array_intersect", "array_join", "array_position", "array_reverse", "array_slice", "array_sort", "array_union", "arrays_overlap", "cardinality", "element_at", "map_contains_key", "map_entries", "map_extract", "map_from_arrays", "map_keys", "map_values", "named_struct", "explode", "explode_outer", "posexplode", "posexplode_outer", "window", "row_number", "rank", "dense_rank"] + return ["col", "lit", "sum", "count", "count_expr", "count_distinct", "count_if", "avg", "min", "max", "int_expr", "float_expr", "str_expr", "bool_expr", "add", "mul", "int_lit", "str_lit", "bool_lit", "always_true", "always_false", "eq", "gt", "cast", "try_cast", "ne", "lt", "lte", "gte", "equal_null", "and_", "or_", "not_", "is_null", "is_not_null", "is_nan", "is_not_nan", "sub", "div", "mod", "neg", "coalesce", "nullif", "case_when", "in_", "between", "asc", "desc", "asc_nulls_first", "asc_nulls_last", "desc_nulls_first", "desc_nulls_last", "abs", "ceil", "floor", "round", "array", "array_contains", "array_distinct", "array_except", "array_flatten", "array_intersect", "array_join", "array_position", "array_reverse", "array_slice", "array_sort", "array_union", "arrays_overlap", "cardinality", "element_at", "map_contains_key", "map_entries", "map_extract", "map_from_arrays", "map_keys", "map_values", "named_struct", "explode", "explode_outer", "posexplode", "posexplode_outer", "window", "row_number", "rank", "dense_rank", "sha224", "sha256", "sha384", "sha512", "sha2", "md5"] def _expected_substrait_mapped_names() -> list[str]: """Return helpers with concrete Substrait extension-function mappings.""" - return ["sum", "count", "count_expr", "avg", "min", "max", "add", "mul", "eq", "gt", "ne", "lt", "lte", "gte", "equal_null", "and_", "or_", "not_", "is_null", "is_not_null", "is_nan", "sub", "div", "mod", "neg", "coalesce", "nullif", "between", "abs", "ceil", "floor", "round", "array", "array_contains", "array_distinct", "array_except", "array_flatten", "array_intersect", "array_join", "array_position", "array_reverse", "array_slice", "array_sort", "array_union", "arrays_overlap", "cardinality", "element_at", "map_entries", "map_extract", "map_from_arrays", "map_keys", "map_values", "named_struct", "row_number", "rank", "dense_rank"] + return ["sum", "count", "count_expr", "avg", "min", "max", "add", "mul", "eq", "gt", "ne", "lt", "lte", "gte", "equal_null", "and_", "or_", "not_", "is_null", "is_not_null", "is_nan", "sub", "div", "mod", "neg", "coalesce", "nullif", "between", "abs", "ceil", "floor", "round", "array", "array_contains", "array_distinct", "array_except", "array_flatten", "array_intersect", "array_join", "array_position", "array_reverse", "array_slice", "array_sort", "array_union", "arrays_overlap", "cardinality", "element_at", "map_entries", "map_extract", "map_from_arrays", "map_keys", "map_values", "named_struct", "row_number", "rank", "dense_rank", "sha224", "sha256", "sha384", "sha512", "md5"] def _exercise_current_public_helpers() -> None: @@ -336,6 +347,12 @@ def _exercise_current_public_helpers() -> None: row_number() rank() dense_rank() + sha224(status) + sha256(status) + sha384(status) + sha512(status) + sha2(status, 256) + md5(status) return @@ -458,7 +475,7 @@ def test_function_registry__core_helpers_expose_portable_policy_metadata() -> No # -- Act / Assert -- for entry in entries: assert entry.namespace == core_function_namespace(), f"{entry.function_ref} should live in the core function namespace" - if entry.canonical_name == "count_expr" or entry.canonical_name == "count_distinct" or entry.canonical_name == "count_if": + if entry.canonical_name == "count_expr" or entry.canonical_name == "count_distinct" or entry.canonical_name == "count_if" or entry.canonical_name == "sha2": assert entry.policy_category == FunctionPolicyCategory.CompatibilityAlias, f"{entry.canonical_name} should be marked as a compatibility helper" assert entry.alias_policy == FunctionAliasPolicy.OptInCompatibility, "compatibility helpers should be opt-in by policy" continue @@ -638,6 +655,11 @@ def test_function_registry__substrait_extension_mappings_are_structured() -> Non _assert_extension_mapping("map_keys", "map_keys", MAP_KEYS_FUNCTION_ANCHOR) _assert_extension_mapping("map_values", "map_values", MAP_VALUES_FUNCTION_ANCHOR) _assert_extension_mapping("named_struct", "named_struct", NAMED_STRUCT_FUNCTION_ANCHOR) + _assert_extension_mapping("sha224", "sha224", SHA224_FUNCTION_ANCHOR) + _assert_extension_mapping("sha256", "sha256", SHA256_FUNCTION_ANCHOR) + _assert_extension_mapping("sha384", "sha384", SHA384_FUNCTION_ANCHOR) + _assert_extension_mapping("sha512", "sha512", SHA512_FUNCTION_ANCHOR) + _assert_extension_mapping("md5", "md5", MD5_FUNCTION_ANCHOR) def test_function_registry__generator_helpers_are_relation_extensions() -> None: @@ -717,6 +739,10 @@ def test_function_registry__rewrite_mappings_identify_non_extension_helpers() -> assert always_false_entry.substrait.kind == SubstraitMappingKind.Rewrite, "always_false should lower as a literal rewrite" _assert_rewrite_mapping("is_not_nan", "not_(is_nan(expr))") _assert_rewrite_mapping("map_contains_key", "gt(cardinality(map_extract(map_expr, key)), int_expr(0))") + _assert_rewrite_mapping( + "sha2", + "sha2(expr, bits) -> sha224/sha256/sha384/sha512(expr) for supported literal bit lengths", + ) assert always_true_entry.null_behavior == FunctionNullBehavior.Predicate, "predicate helpers should expose predicate null behavior" assert always_false_entry.null_behavior == FunctionNullBehavior.Predicate, "predicate helpers should expose predicate null behavior" @@ -749,6 +775,7 @@ def test_function_registry__public_helpers_preserve_existing_behavior() -> None: status, [str_lit("paid"), str_lit("open")], ), lt(amount, int_lit(10)), lte(amount, int_lit(10)), modulo(amount, lit(2)), round(amount)] + hash_exprs = [md5(status), sha2(status, 256), sha224(status), sha256(status), sha384(status), sha512(status)] # -- Assert -- assert column_expr_kind(amount) == ColumnExprKind.Column, "col should still build a column reference" @@ -770,5 +797,7 @@ def test_function_registry__public_helpers_preserve_existing_behavior() -> None: assert column_expr_kind(gt_expr) == ColumnExprKind.ScalarFunction, "gt should use the shared scalar function kind" for core_expr in core_exprs: assert column_expr_kind(core_expr) != ColumnExprKind.Column, "core scalar helpers should build scalar expressions" + for hash_expr in hash_exprs: + assert column_expr_kind(hash_expr) == ColumnExprKind.ScalarFunction, "hash helpers should build scalar expressions" assert column_expr_kind(always_true()) == ColumnExprKind.BoolLiteral, "always_true should still build a bool literal" assert column_expr_kind(always_false()) == ColumnExprKind.BoolLiteral, "always_false should still build a bool literal" diff --git a/tests/test_hashing_functions.incn b/tests/test_hashing_functions.incn new file mode 100644 index 0000000..2cfc4c5 --- /dev/null +++ b/tests/test_hashing_functions.incn @@ -0,0 +1,55 @@ +"""Test: RFC 022 hashing helper surface.""" + +from std.testing import assert_raises +from functions import col, md5, sha2, sha224, sha256, sha384, sha512 +from function_registry import function_ref_for +from projection_builders import ( + ColumnExpr, + ColumnExprKind, + column_expr_argument_count, + column_expr_function_name, + column_expr_function_ref, + column_expr_kind, +) + + +def _assert_hash_application(expr: ColumnExpr, expected_name: str) -> None: + """Assert one hashing helper builds a registry-backed scalar application.""" + assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction, f"{expected_name} should use scalar application nodes" + assert column_expr_function_name(expr) == expected_name, f"{expected_name} should preserve its canonical name" + assert column_expr_function_ref(expr) == function_ref_for(expected_name), f"{expected_name} should preserve its function ref" + assert column_expr_argument_count(expr) == 1, f"{expected_name} should carry one string input expression" + + +def _call_sha2_with_unsupported_length() -> None: + """Call sha2 with an unsupported digest length for ValueError assertions.""" + sha2(col("payload"), 1) + return + + +def test_hashing_functions__concrete_helpers_share_scalar_application_node() -> None: + # -- Arrange -- + payload = col("payload") + + # -- Act / Assert -- + _assert_hash_application(md5(payload), "md5") + _assert_hash_application(sha224(payload), "sha224") + _assert_hash_application(sha256(payload), "sha256") + _assert_hash_application(sha384(payload), "sha384") + _assert_hash_application(sha512(payload), "sha512") + + +def test_hashing_functions__sha2_rewrites_to_concrete_sha2_helpers() -> None: + # -- Arrange -- + payload = col("payload") + + # -- Act / Assert -- + _assert_hash_application(sha2(payload, 224), "sha224") + _assert_hash_application(sha2(payload, 256), "sha256") + _assert_hash_application(sha2(payload, 384), "sha384") + _assert_hash_application(sha2(payload, 512), "sha512") + + +def test_hashing_functions__sha2_rejects_unsupported_digest_lengths() -> None: + # -- Arrange / Act / Assert -- + assert_raises[ValueError](_call_sha2_with_unsupported_length) diff --git a/tests/test_session_projection.incn b/tests/test_session_projection.incn index fb6207e..d3d2132 100644 --- a/tests/test_session_projection.incn +++ b/tests/test_session_projection.incn @@ -18,11 +18,16 @@ from functions import ( floor, gt, lit, + md5, modulo, mul, neg, nullif, round, + sha2, + sha224, + sha384, + sha512, sub, try_cast, cardinality, @@ -192,6 +197,40 @@ def test_session_projection__collect_executes_common_math_scalar_projection_func assert payload.contains("3"), "round projection should include round(10 / 4.0)" +def test_session_projection__collect_executes_format_hashing_projection_functions() -> None: + """collect should execute the first RFC 022 hashing helpers through DataFusion.""" + # -- Arrange -- + mut session = Session.default() + + # -- Act -- + lazy: LazyFrame[AggregateOrder] = assert_is_ok( + session.read_csv("aggregate_orders", AGGREGATE_ORDERS_CSV_FIXTURE), + "aggregate orders fixture should load", + ) + projected = lazy.with_column("md5_abc", md5(lit("abc"))).with_column("sha224_abc", sha224(lit("abc"))).with_column( + "sha2_256_abc", + sha2(lit("abc"), 256), + ).with_column("sha384_abc", sha384(lit("abc"))).with_column("sha512_abc", sha512(lit("abc"))) + df = _collect_or_fail(session, projected) + payload = df.preview_text() + resolved = df.resolved_columns() + + # -- Assert -- + assert df.row_count() == 3, "hashing projections should preserve the input rows" + assert len(resolved) == 7, "projection should expose all appended hash outputs" + assert payload.contains("md5_abc"), "md5 projection should materialize its alias" + assert payload.contains("sha2_256_abc"), "sha2 compatibility projection should materialize its alias" + assert payload.contains("900150983cd24fb0d6963f7d28e17f72"), "md5 should return the lowercase hex digest for abc" + assert payload.contains("23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7"), "sha224 should return the lowercase hex digest for abc" + assert payload.contains("ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad"), "sha2(..., 256) should rewrite to sha256" + assert payload.contains( + "cb00753f45a35e8bb5a03d699ac65007272c32ab0eded1631a8b605a43ff5bed8086072ba1e7cc2358baeca134c825a7", + ), "sha384 should return the lowercase hex digest for abc" + assert payload.contains( + "ddaf35a193617abacc417349ae20413112e6fa4e89a97ea20a9eeee64b55d39a2192992a274fc1a836ba3c23a3feebbd454d4423643ce80e2a9ac94fa54ca49f", + ), "sha512 should return the lowercase hex digest for abc" + + def test_session_projection__collect_executes_nested_scalar_projection_functions() -> None: """collect should execute RFC 020 nested scalar helpers through DataFusion.""" # -- Arrange -- diff --git a/tests/test_substrait_plan.incn b/tests/test_substrait_plan.incn index 8a44395..0ec722b 100644 --- a/tests/test_substrait_plan.incn +++ b/tests/test_substrait_plan.incn @@ -55,6 +55,7 @@ from functions import ( map_keys, map_values, max, + md5, min, modulo, mul, @@ -67,6 +68,11 @@ from functions import ( rank, round, row_number, + sha2, + sha224, + sha256, + sha384, + sha512, sub, sum, try_cast, @@ -435,6 +441,12 @@ def test_plan__core_scalar_extension_mappings_lower_to_substrait() -> None: _assert_scalar_expr_lowers(ceil(div(col("amount"), lit(4.0)))) _assert_scalar_expr_lowers(floor(div(col("amount"), lit(4.0)))) _assert_scalar_expr_lowers(round(div(col("amount"), lit(4.0)))) + _assert_scalar_expr_lowers(md5(col("status"))) + _assert_scalar_expr_lowers(sha224(col("status"))) + _assert_scalar_expr_lowers(sha256(col("status"))) + _assert_scalar_expr_lowers(sha384(col("status"))) + _assert_scalar_expr_lowers(sha512(col("status"))) + _assert_scalar_expr_lowers(sha2(col("status"), 256)) def test_plan__nested_scalar_extension_mappings_lower_to_substrait() -> None: