Create chunking algorithm

This new interface method is needed:

Create chunks from a zchunk file, compress them, write a header.
Or in short: Create a zchunk file.

The same rules as in https://github.com/zchunk/zchunk/issues/4 should apply.

> I'd suggest a lower bound of 64 bytes and an upper bound of 1 Mbyte [for uncompressed chunks].

> there are options ZCK_CHUNK_MIN and ZCK_CHUNK_MAX that can be set, but the defaults are 1 byte for the min and 10MB for the max.

Things which need to be implemented:

## Basic design goals

 - [ ] Manual chunking
   - [ ] Split string (which goes to the NEXT chunk
   - [ ] Split string (which goes to the Current chunk, not the next one)
    - [ ] calculate optimal chunk size (combine / split created chunks).
          Maybe we need a 2-pass-algorithm, which would first try to chunk it with the given string, but if a chunk is `> avg*4 || > ZCK_CHUNK_MAX`, split it. If a chunk is `< avg / 4 || < ZCK_CHUNK_MIN`, merge it with the next one.
 - [ ] automatic splitting using `buzhash`
    - [ ] calculate optimal chunk size.
      - [ ] If it is text, try chunking at line breaks with ZCK_CHUNK_MIN in mind.
      - [ ] For all other files, create a 2-pass-algorithm similar to above algorithm.
    - [ ] Some intelligent separators for common file formats.
    - [ ] If ZCK_CHUNK_MIN and/or _MAX are not set, chose from a standard config.
    - [ ] What is a good default separator? We should make even binary formats more likely to have same chunks at different versions.

## About ZCK_CHUNK_MIN / ZCK_CHUNK_MAX

About `ZCK_CHUNK_MIN`: The HTTP(2) headers alone take some bytes. Connection overhead + request headers + response headers can easily become `1024 bytes` or `1 KiB`. That said, I'd say set `ZCK_CHUNK_MIN` to at least `5 KiB`, better `10 KiB`, because it does not make sense to send 1k of connection overhead to receive 1 byte of usable data.

I think the `ZCK_CHUNK_MAX` depends on the file format and the target audience. If you are downloading isos (let's say you have your ubuntu-18.04.iso and want to download ubuntu-18.04.2, which are both about 1.9 GB) you might want to have a different ZCK_CHUNK_MAX compared to a 6 to 40 MiB repository metadata file. But: The bigger the chunk, the more unlikely it is to get a same hash, especially for binary formats.

## Default properties file / `fileformats.chunking.yml`

We could create a file for predefined file formats. Somethink like this: https://gist.github.com/bmhm/18b57655e0c0c8a5a38d6cdf487866e4

This file could easily be extended for other file formats.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create chunking algorithm #2

Basic design goals

About ZCK_CHUNK_MIN / ZCK_CHUNK_MAX

Default properties file / `fileformats.chunking.yml`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Create chunking algorithm #2

Description

Basic design goals

About ZCK_CHUNK_MIN / ZCK_CHUNK_MAX

Default properties file / fileformats.chunking.yml

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Default properties file / `fileformats.chunking.yml`