Skip to content

Commit e444968

Browse files
committed
more editing
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
1 parent 6591115 commit e444968

1 file changed

Lines changed: 16 additions & 16 deletions

File tree

proposals/0015-variant-type.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,13 @@
88

99
Vortex currently requires a strict schema, but real world data is often only semi-structured and deeply hierarchical. Logs, traces and user-generated data often take the form of many sparse fields.
1010

11-
This proposal introduces a new type - `Variant`, which can capture data with row-level schema, while storing it in a columnar form that can compress well while being available for efficient analysis in a columnar format.
11+
This proposal introduces a new dtype - `Variant`, which can capture data with row-level schema, while storing it in a columnar form that can compress well while being available for efficient analysis in a columnar format.
1212

1313
## Design
1414

15-
We'll start with a rough description of the variant type, as many different systems define in different ways (see the [Prior Art] section at the bottom of the page).
15+
We'll start with a rough description of the variant type, as many different systems define in different ways (see the [Prior Art](#prior-art) section at the bottom of the page).
1616

17-
The variant can be commonly described as the following rust type:
17+
The variant type can be commonly described as the following rust type:
1818

1919
```rust
2020
enum Variant {
@@ -24,9 +24,15 @@ enum Variant {
2424
}
2525
```
2626

27-
Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. In addition to this "catch all" column, most system include the concept of "shredding", extracting a key with a specific type out of this column, and storing it in a dense way. This design can make commonly accessed subfields perform like first-class columns, while keeping the overall schema flexible.
27+
Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema.
2828

29-
I propose a new dtype - `DType::Variant`. The variant type is always nullable, and its canonical encoding is just an array with a single child array, which is encoded in some specialized variant type.
29+
Variant types are usually stored in two ways - values that aren't accessed often in some system-specific binary encoding, and some number of "shredded" columns, where a specific key is extracted from the variant and stored in a dense format with a specific type, allowing for much more performant access. This design can make commonly accessed subfields perform like first-class columns, while keeping the overall schema flexible. Shredding policies differ by system, and can be pre-determined or inferred from the data itself or from usage patterns.
30+
31+
### Arrow representation
32+
33+
Arrow now has a new [canonical extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#parquet-variant) to represent Parquet's variant type. I think supporting this encoding will be a good start, but it requires supporting Arrow extension types.
34+
35+
Supporting extension types requires replacing the target `DataType` and nullability with a `Field`, which also includes metadata like a desired extension type. I believe this change is desired, as Vortex DTypes include more information than a plain `arrow::DataType`.
3036

3137
### Nullability
3238

@@ -45,25 +51,19 @@ I suggest we add a new expression - `get_variant_element(path, dtype)` (name TBD
4551

4652
Every variant encoding will need to be able to dispatch these behaviors, returning arrays of the expected type.
4753

48-
### Stats and pushdown
49-
50-
Statistics will only be collected for shredded children of the variant array. As all variant expressions are typed, this will allow us to not only use the same type of pushdown we have currently have, but for row ranges where specific key might exist but with an unexpected type, we could skip it completely.
51-
52-
### Arrow representation
53-
54-
Arrow now has a new [canonical extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#parquet-variant) to represent Parquet's variant type. I think supporting this encoding will be a good start, but it requires supporting Arrow extension types.
55-
56-
Supporting extension types requires replacing the target `DataType` and nullability with a `Field`, which also includes metadata like a desired extension type. I believe this change is desired, as Vortex DTypes include more information than a plain `arrow::DataType`.
57-
5854
### Scalar
5955

6056
While there has been talk for a long time of converting the Vortex scalar system from an enum to length 1 arrays, I do believe the current system actually works very well for variants, and the Variant scalar can just be some version of the type described above.
6157

6258
Just like when extracting child arrays, Variant's need to support an additional expression, `get_variant_scalar(idx, path, dtype)` that will indicate the desired dtype.
6359

60+
### Stats and pushdown
61+
62+
Statistics will only be collected for shredded children of the variant array. As all variant expressions are typed, this will allow us to not only use the same type of pushdown we currently support, but for row ranges where specific key might exist but with an unexpected type, we could skip it completely.
63+
6464
### Path to usefulness
6565

66-
A key component of making variants useable will be making sure the experience of writing and using them, without forcing them to go through complex builders or serialization (unless they require it).
66+
A key component of making variants useable will be making sure the experience of writing and using them is as straightforward as possible, without forcing them to go through complex builders or serialization (unless they require it).
6767

6868
I can see multiple things we can do:
6969

0 commit comments

Comments
 (0)