The "current" pathways serialization format (what I've been calling "abomination") is a custom format that requires serializer/deserializer modification to extend.
Issue #25 perhaps requires extension of the format. Rather that extend the format, this issue proposes replacing the custom format with a standard serialization format.
Some options to consider (also a paper):
| Serialization format |
Human readable |
Appendable |
Multi-language impls |
Standardised |
Extensible |
Compact |
| Abomination |
✅ |
✅ |
❌ |
❌ |
❌ |
❌ |
| YAML |
✅ |
✅ |
✅ |
❌ |
✅ |
❌ |
| JSON |
✅ |
❌ |
✅ |
✅ |
✅ |
❌ |
| SQLite |
❌ |
✅ |
✅ |
❌ |
✅ |
✅ |
| JSONlines |
✅ |
✅ |
✅ |
❌ |
✅ |
❌ |
| CBOR |
❌ |
❓ |
✅ |
✅ |
❓ |
✅ |
| Protobuffers |
❌ |
❓ |
✅ |
❌ |
❓ |
✅ |
| Flatbuffers |
❌ |
❓ |
✅ |
❌ |
❓ |
✅ |
| Avro |
❌ |
❓ |
✅ |
❌ |
❓ |
✅ |
| Thrift |
❌ |
❓ |
✅ |
❌ |
❓ |
✅ |
| Cap'n'proto |
❌ |
❓ |
✅ |
❌ |
❓ |
✅ |
| Twine |
❓ |
❓ |
❓ |
❓ |
❓ |
|
| Preserves |
❓ |
❓ |
❓ |
❓ |
❓ |
❓ |
| UBJSON |
❓ |
❓ |
❓ |
❓ |
❓ |
❓ |
| Postcard |
❓ |
❓ |
❓ |
❓ |
❓ |
❓ |
- Human readable
- Plain text encoding.
- Appendable
- I don't need to know about more than the single row I'm appending to the file in order to append (e.g. JSON is not appendable because of array delimiters).
- Multi-language impls
- There are off-the-shelf serializers/deserializes for the format in Python and at least 1 other language.
- Standardised
- The format is documented in an internet standard from IEEE, W3C, etc.
- Extensible
- When a field is added, old software can still work with data serialized with the new field.
Why?
I think there are a few good reasons to consider this change.
- Current format requires specialised knowledge to understand the data format itself (not just domain knowledge of metabolomics). Using a more common format means that someone receiving the data can use an off-the-shelf parser and be confident that it works.
- eval()-ing Python is slow and dangerous. literal_eval() is safer but still much slower that parsing, say, YAML. Also opens up the possibility of using the data in non-Python languages. I'm already doing this a little in the command-line client app I'll give you which is written in Rust. Parsing literal Python values is painful but possible outside of Python and using a more common format makes this much easier.
- Extending the data you want to store becomes much easier. You don't have to make fundamental adjustments to the format to add a citation field. In YAML you would just add an optional key to each object in the list that has a citation. In sqlite you'd add an extra nullable column. Both these options are less data than a 3 byte empty list for each entry in your custom format (~35k for a single 12000 entry file).
The "current" pathways serialization format (what I've been calling "abomination") is a custom format that requires serializer/deserializer modification to extend.
Issue #25 perhaps requires extension of the format. Rather that extend the format, this issue proposes replacing the custom format with a standard serialization format.
Some options to consider (also a paper):
Why?
I think there are a few good reasons to consider this change.