sharpninja · sharpninja · Mar 21, 2026 · Mar 21, 2026 · Mar 21, 2026 · Mar 21, 2026
diff --git a/docs/benchmarking.md b/docs/benchmarking.md
@@ -23,6 +23,10 @@ The manual GitHub Actions benchmark report workflow runs the same benchmark suit
 - side-by-side comparison statistics for response throughput, training mean time, response allocation, estimated resident model memory, and WikiText2 perplexity
 - a paper-alignment audit for the canonical BitNet model so the report shows implemented architecture guarantees plus repository-local training, perplexity, zero-shot fixture, and checkpoint round-trip coverage
 
+The repository now vendors the full pre-tokenized WikiText-2 corpus under `src/BitNetSharp.Core/Data/WikiText2/`, and the benchmark perplexity comparison uses the repository-local `wiki.valid.tokens` validation split loaded line-for-line, including blank separator rows from the original tokenized corpus.
+
+If you want to refresh the committed benchmark corpora from a local clone, run `scripts/process_full_corpora.py` with your full TinyLlama source plus either `--wikitext-source-dir` or `--download-wikitext`. The script rewrites `src/BitNetSharp.Core/Data/WikiText2/wiki.*.tokens` in place and emits normalized TinyLlama train/validation/test JSONL files under `src/BitNetSharp.Core/Data/TinyLlama/` for local review and commit.
+
 ## Run the built-in comparison benchmark
 
 ```bash

diff --git a/docs/usage.md b/docs/usage.md
@@ -68,6 +68,18 @@ dotnet run --project src/BitNetSharp.App/BitNetSharp.App.csproj -- datagen --dom
 
 This command reads optional seed examples, merges the built-in pattern prompts with the repository template, and writes JSONL output for downstream local fine-tuning or evaluation. Optional flags include `--task-type`, `--constraint`, `--constraints`, `--output-schema`, `--template`, `--candidate-count`, `--min-quality`, `--max-tokens`, and `--lora`. The emitted JSONL includes both the core generator fields (`seedInstruction`, `variation`, `generatorModel`, `tags`) and the merged prompt metadata (`prompt`, `taskType`, `qualityScore`, `generationTimestamp`, `groundingContext`). See the [DataGen guide](datagen-guide.md) for accepted seed aliases and the merged output schema.
 
+## Refresh full TinyLlama and WikiText-2 corpora
+
+```bash
+python scripts/process_full_corpora.py \
+  --tinyllama-source /absolute/path/to/tinyllama.jsonl \
+  --wikitext-source-dir /absolute/path/to/wikitext-2 \
+  --commit \
+  --commit-message "Vendor full TinyLlama and WikiText-2 corpora"
+```
+
+The script writes normalized TinyLlama train/validation/test JSONL files under `src/BitNetSharp.Core/Data/TinyLlama/`, refreshes the vendored WikiText-2 token files under `src/BitNetSharp.Core/Data/WikiText2/`, preserves the blank separator rows from the tokenized WikiText-2 corpus, and can optionally stage and commit the updated data files from your local clone. If you do not already have local WikiText-2 files, pass `--download-wikitext` to pull the tokenized `wiki.train.tokens`, `wiki.valid.tokens`, and `wiki.test.tokens` files from the default public source before writing them into the repository.
+
 ## Train the traditional comparison model
 
 ```bash