You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: proposals/0015-variant-type.md
+16-16Lines changed: 16 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,13 +8,13 @@
8
8
9
9
Vortex currently requires a strict schema, but real world data is often only semi-structured and deeply hierarchical. Logs, traces and user-generated data often take the form of many sparse fields.
10
10
11
-
This proposal introduces a new type - `Variant`, which can capture data with row-level schema, while storing it in a columnar form that can compress well while being available for efficient analysis in a columnar format.
11
+
This proposal introduces a new dtype - `Variant`, which can capture data with row-level schema, while storing it in a columnar form that can compress well while being available for efficient analysis in a columnar format.
12
12
13
13
## Design
14
14
15
-
We'll start with a rough description of the variant type, as many different systems define in different ways (see the [Prior Art] section at the bottom of the page).
15
+
We'll start with a rough description of the variant type, as many different systems define in different ways (see the [Prior Art](#prior-art) section at the bottom of the page).
16
16
17
-
The variant can be commonly described as the following rust type:
17
+
The variant type can be commonly described as the following rust type:
18
18
19
19
```rust
20
20
enumVariant {
@@ -24,9 +24,15 @@ enum Variant {
24
24
}
25
25
```
26
26
27
-
Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. In addition to this "catch all" column, most system include the concept of "shredding", extracting a key with a specific type out of this column, and storing it in a dense way. This design can make commonly accessed subfields perform like first-class columns, while keeping the overall schema flexible.
27
+
Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema.
28
28
29
-
I propose a new dtype - `DType::Variant`. The variant type is always nullable, and its canonical encoding is just an array with a single child array, which is encoded in some specialized variant type.
29
+
Variant types are usually stored in two ways - values that aren't accessed often in some system-specific binary encoding, and some number of "shredded" columns, where a specific key is extracted from the variant and stored in a dense format with a specific type, allowing for much more performant access. This design can make commonly accessed subfields perform like first-class columns, while keeping the overall schema flexible. Shredding policies differ by system, and can be pre-determined or inferred from the data itself or from usage patterns.
30
+
31
+
### Arrow representation
32
+
33
+
Arrow now has a new [canonical extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#parquet-variant) to represent Parquet's variant type. I think supporting this encoding will be a good start, but it requires supporting Arrow extension types.
34
+
35
+
Supporting extension types requires replacing the target `DataType` and nullability with a `Field`, which also includes metadata like a desired extension type. I believe this change is desired, as Vortex DTypes include more information than a plain `arrow::DataType`.
30
36
31
37
### Nullability
32
38
@@ -45,25 +51,19 @@ I suggest we add a new expression - `get_variant_element(path, dtype)` (name TBD
45
51
46
52
Every variant encoding will need to be able to dispatch these behaviors, returning arrays of the expected type.
47
53
48
-
### Stats and pushdown
49
-
50
-
Statistics will only be collected for shredded children of the variant array. As all variant expressions are typed, this will allow us to not only use the same type of pushdown we have currently have, but for row ranges where specific key might exist but with an unexpected type, we could skip it completely.
51
-
52
-
### Arrow representation
53
-
54
-
Arrow now has a new [canonical extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#parquet-variant) to represent Parquet's variant type. I think supporting this encoding will be a good start, but it requires supporting Arrow extension types.
55
-
56
-
Supporting extension types requires replacing the target `DataType` and nullability with a `Field`, which also includes metadata like a desired extension type. I believe this change is desired, as Vortex DTypes include more information than a plain `arrow::DataType`.
57
-
58
54
### Scalar
59
55
60
56
While there has been talk for a long time of converting the Vortex scalar system from an enum to length 1 arrays, I do believe the current system actually works very well for variants, and the Variant scalar can just be some version of the type described above.
61
57
62
58
Just like when extracting child arrays, Variant's need to support an additional expression, `get_variant_scalar(idx, path, dtype)` that will indicate the desired dtype.
63
59
60
+
### Stats and pushdown
61
+
62
+
Statistics will only be collected for shredded children of the variant array. As all variant expressions are typed, this will allow us to not only use the same type of pushdown we currently support, but for row ranges where specific key might exist but with an unexpected type, we could skip it completely.
63
+
64
64
### Path to usefulness
65
65
66
-
A key component of making variants useable will be making sure the experience of writing and using them, without forcing them to go through complex builders or serialization (unless they require it).
66
+
A key component of making variants useable will be making sure the experience of writing and using them is as straightforward as possible, without forcing them to go through complex builders or serialization (unless they require it).
0 commit comments