Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@ The manual GitHub Actions benchmark report workflow runs the same benchmark suit
- side-by-side comparison statistics for response throughput, training mean time, response allocation, estimated resident model memory, and WikiText2 perplexity
- a paper-alignment audit for the canonical BitNet model so the report shows implemented architecture guarantees plus repository-local training, perplexity, zero-shot fixture, and checkpoint round-trip coverage

The repository now vendors the full pre-tokenized WikiText-2 corpus under `src/BitNetSharp.Core/Data/WikiText2/`, and the benchmark perplexity comparison uses the repository-local `wiki.valid.tokens` validation split loaded line-for-line, including blank separator rows from the original tokenized corpus.

If you want to refresh the committed benchmark corpora from a local clone, run `scripts/process_full_corpora.py` with your full TinyLlama source plus either `--wikitext-source-dir` or `--download-wikitext`. The script rewrites `src/BitNetSharp.Core/Data/WikiText2/wiki.*.tokens` in place and emits normalized TinyLlama train/validation/test JSONL files under `src/BitNetSharp.Core/Data/TinyLlama/` for local review and commit.

## Run the built-in comparison benchmark

```bash
Expand Down
12 changes: 12 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,18 @@ dotnet run --project src/BitNetSharp.App/BitNetSharp.App.csproj -- datagen --dom

This command reads optional seed examples, merges the built-in pattern prompts with the repository template, and writes JSONL output for downstream local fine-tuning or evaluation. Optional flags include `--task-type`, `--constraint`, `--constraints`, `--output-schema`, `--template`, `--candidate-count`, `--min-quality`, `--max-tokens`, and `--lora`. The emitted JSONL includes both the core generator fields (`seedInstruction`, `variation`, `generatorModel`, `tags`) and the merged prompt metadata (`prompt`, `taskType`, `qualityScore`, `generationTimestamp`, `groundingContext`). See the [DataGen guide](datagen-guide.md) for accepted seed aliases and the merged output schema.

## Refresh full TinyLlama and WikiText-2 corpora

```bash
python scripts/process_full_corpora.py \
--tinyllama-source /absolute/path/to/tinyllama.jsonl \
--wikitext-source-dir /absolute/path/to/wikitext-2 \
--commit \
--commit-message "Vendor full TinyLlama and WikiText-2 corpora"
```

The script writes normalized TinyLlama train/validation/test JSONL files under `src/BitNetSharp.Core/Data/TinyLlama/`, refreshes the vendored WikiText-2 token files under `src/BitNetSharp.Core/Data/WikiText2/`, preserves the blank separator rows from the tokenized WikiText-2 corpus, and can optionally stage and commit the updated data files from your local clone. If you do not already have local WikiText-2 files, pass `--download-wikitext` to pull the tokenized `wiki.train.tokens`, `wiki.valid.tokens`, and `wiki.test.tokens` files from the default public source before writing them into the repository.

## Train the traditional comparison model

```bash
Expand Down
Loading