This new interface method is needed:
Create chunks from a zchunk file, compress them, write a header.
Or in short: Create a zchunk file.
The same rules as in zchunk/zchunk#4 should apply.
I'd suggest a lower bound of 64 bytes and an upper bound of 1 Mbyte [for uncompressed chunks].
there are options ZCK_CHUNK_MIN and ZCK_CHUNK_MAX that can be set, but the defaults are 1 byte for the min and 10MB for the max.
Things which need to be implemented:
Basic design goals
About ZCK_CHUNK_MIN / ZCK_CHUNK_MAX
About ZCK_CHUNK_MIN: The HTTP(2) headers alone take some bytes. Connection overhead + request headers + response headers can easily become 1024 bytes or 1 KiB. That said, I'd say set ZCK_CHUNK_MIN to at least 5 KiB, better 10 KiB, because it does not make sense to send 1k of connection overhead to receive 1 byte of usable data.
I think the ZCK_CHUNK_MAX depends on the file format and the target audience. If you are downloading isos (let's say you have your ubuntu-18.04.iso and want to download ubuntu-18.04.2, which are both about 1.9 GB) you might want to have a different ZCK_CHUNK_MAX compared to a 6 to 40 MiB repository metadata file. But: The bigger the chunk, the more unlikely it is to get a same hash, especially for binary formats.
Default properties file / fileformats.chunking.yml
We could create a file for predefined file formats. Somethink like this: https://gist.github.com/bmhm/18b57655e0c0c8a5a38d6cdf487866e4
This file could easily be extended for other file formats.
This new interface method is needed:
Create chunks from a zchunk file, compress them, write a header.
Or in short: Create a zchunk file.
The same rules as in zchunk/zchunk#4 should apply.
Things which need to be implemented:
Basic design goals
Maybe we need a 2-pass-algorithm, which would first try to chunk it with the given string, but if a chunk is
> avg*4 || > ZCK_CHUNK_MAX, split it. If a chunk is< avg / 4 || < ZCK_CHUNK_MIN, merge it with the next one.buzhashAbout ZCK_CHUNK_MIN / ZCK_CHUNK_MAX
About
ZCK_CHUNK_MIN: The HTTP(2) headers alone take some bytes. Connection overhead + request headers + response headers can easily become1024 bytesor1 KiB. That said, I'd say setZCK_CHUNK_MINto at least5 KiB, better10 KiB, because it does not make sense to send 1k of connection overhead to receive 1 byte of usable data.I think the
ZCK_CHUNK_MAXdepends on the file format and the target audience. If you are downloading isos (let's say you have your ubuntu-18.04.iso and want to download ubuntu-18.04.2, which are both about 1.9 GB) you might want to have a different ZCK_CHUNK_MAX compared to a 6 to 40 MiB repository metadata file. But: The bigger the chunk, the more unlikely it is to get a same hash, especially for binary formats.Default properties file /
fileformats.chunking.ymlWe could create a file for predefined file formats. Somethink like this: https://gist.github.com/bmhm/18b57655e0c0c8a5a38d6cdf487866e4
This file could easily be extended for other file formats.