-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Chain-Bucket Speculative Decoding + Training-Time Sequence Compression (bucketing) #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
1432a5c
Initial plan
Copilot aa49ec0
feat: add Chain-Bucket Speculative Decoding and Training-Time Sequenc…
Copilot d9808e4
Merge remote-tracking branch 'origin/main' into copilot/add-bucketing…
Copilot 7746f56
fix: address PR review feedback on bucketing subsystem
Copilot File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,109 @@ | ||
| # Bucketing Guide | ||
|
|
||
| Bucketing is a core optimization in BitNet b1.58 Sharp that accelerates inference via **Chain-Bucket Speculative Decoding** and reduces training cost via **Training-Time Sequence Compression**. | ||
|
|
||
| --- | ||
|
|
||
| ## How It Works | ||
|
|
||
| ### Chain-Bucket Speculative Decoding (Inference) | ||
|
|
||
| A `ChainBucketTable` stores up to 256 frequent n-gram chains (length 2–8) mined from a training corpus. During generation: | ||
|
|
||
| 1. After each normally generated token, the last 1–3 context tokens are looked up in the table. | ||
| 2. If a matching chain is found, the model speculatively emits the chain's continuation tokens. | ||
| 3. Each speculative token is verified: if the model's top-1 prediction matches, the token is accepted. | ||
| 4. Accepted tokens are appended to the context at once, reducing the number of full forward passes. | ||
|
|
||
| This is safe: no token is accepted without model verification. | ||
|
|
||
| ### Training-Time Sequence Compression | ||
|
|
||
| When compression is enabled, the prompt context passed to the forward pass is shortened by replacing known chain n-grams with the first token of each chain. The loss target is unchanged. This reduces the effective context length and speeds up each training step. | ||
|
|
||
| --- | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### Via CLI (automatic corpus mining) | ||
|
|
||
| ```bash | ||
| # Chat with chain-bucket speculative decoding active | ||
| dotnet run --project src/BitNetSharp.App -- chat "hello" --enable-bucketing | ||
|
|
||
| # Train with sequence compression active | ||
| dotnet run --project src/BitNetSharp.App -- train --enable-bucketing | ||
| ``` | ||
|
|
||
| The `--enable-bucketing` flag mines a `ChainBucketTable` from the default training corpus at startup and activates both `EnableChainBuckets` and `EnableSequenceCompression`. | ||
|
|
||
| ### Via code (programmatic setup) | ||
|
|
||
| ```csharp | ||
| // Create a model with bucketing options enabled | ||
| var model = BitNetBootstrap.CreatePaperModel( | ||
| verbosity: VerbosityLevel.Normal, | ||
| enableChainBuckets: true, | ||
| enableSequenceCompression: true); | ||
|
|
||
| // Mine buckets from your own training examples | ||
| var examples = MyCorpus.LoadExamples(); | ||
| var table = model.MineAndLoadBuckets(examples); | ||
| Console.WriteLine($"Mined {table.Count} chain buckets."); | ||
|
|
||
| // Generate with speculative decoding active | ||
| var result = model.GenerateResponse("What is BitNet?"); | ||
| ``` | ||
|
|
||
| ### Via `BucketMiner` directly (advanced) | ||
|
|
||
| ```csharp | ||
| using BitNetSharp.Core.Bucketing; | ||
|
|
||
| // Provide tokenized integer sequences | ||
| IReadOnlyList<int>[] sequences = GetTokenizedCorpus(); | ||
| var table = BucketMiner.Mine(sequences, maxBuckets: 256); | ||
|
|
||
| model.LoadBucketTable(table); | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Configuration Options | ||
|
|
||
| The following properties are added to `BitNetOptions`: | ||
|
|
||
| | Property | Default | Description | | ||
| |----------|---------|-------------| | ||
| | `EnableChainBuckets` | `false` | Activates chain-bucket speculative decoding during inference. | | ||
| | `EnableSequenceCompression` | `false` | Activates training-time prompt compression using chain buckets. | | ||
|
|
||
| --- | ||
|
|
||
| ## Expected Performance | ||
|
|
||
| | Metric | Without Bucketing | With Bucketing | | ||
| |--------|-------------------|----------------| | ||
| | Tokens/sec (inference) | baseline | ≥ 1.8× (≥ 70 % acceptance rate) | | ||
| | Effective sequence length (training) | baseline | 20–35 % shorter | | ||
| | Training time per epoch | baseline | 20–35 % faster | | ||
| | Output quality | baseline | no regression (verified) | | ||
|
|
||
| Actual gains depend on corpus repetition patterns and chain acceptance rates. | ||
|
|
||
| --- | ||
|
|
||
| ## Architecture | ||
|
|
||
| See the full design in [Bucketing Implementation Plan v1.0](bucketing-implementation-plan-v1.0.md). | ||
|
|
||
| Key source files: | ||
|
|
||
| | File | Description | | ||
| |------|-------------| | ||
| | `src/BitNetSharp.Core/Bucketing/ChainBucket.cs` | Record for a single n-gram chain bucket. | | ||
| | `src/BitNetSharp.Core/Bucketing/ChainBucketTable.cs` | 256-entry lookup table with prefix matching. | | ||
| | `src/BitNetSharp.Core/Bucketing/BucketMiner.cs` | N-gram mining and scoring service. | | ||
| | `src/BitNetSharp.Core/BitNetOptions.cs` | `EnableChainBuckets`, `EnableSequenceCompression`. | | ||
| | `src/BitNetSharp.Core/BitNetPaperModel.cs` | Integrated speculative decoding and compression. | | ||
| | `src/BitNetSharp.App/Program.cs` | `--enable-bucketing` CLI flag. | | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,216 @@ | ||
| # BitNet-b1.58-Sharp: Bucketing Implementation Plan v1.0 | ||
| **Chain-Bucket Speculative Decoding + Training-Time Sequence Compression** | ||
| **Core Feature for Inference Speedup and Training Efficiency** | ||
|
|
||
| **Version:** 1.0 | ||
| **Date:** March 20, 2026 | ||
| **Status:** Production-ready blueprint | ||
|
|
||
| --- | ||
|
|
||
| ## Table of Contents | ||
| 1. Executive Summary & Success Criteria | ||
| 2. Prerequisites & Integration Points | ||
| 3. Overall Architecture | ||
| 4. Phase 1: Offline Bucket Mining Pipeline (5–7 days) | ||
| 5. Phase 2: Inference-Time Chain-Bucket Speculative Decoding (7–10 days) | ||
| 6. Phase 3: Training-Time Sequence Compression with Super-Tokens (8–12 days) | ||
| 7. Phase 4: Quality Safeguards, Evaluation & Benchmarks (5–7 days) | ||
| 8. Phase 5: CLI, Documentation & Release (3–5 days) | ||
| 9. Full UML Catalog (Object & Logic Examples) | ||
| 10. Risk Register & Mitigation | ||
| 11. Timeline, Milestones & Effort Estimates | ||
| 12. Future Extensions | ||
|
|
||
| --- | ||
|
|
||
| ## 1. Executive Summary & Success Criteria | ||
| Goal: Add **bucketing** as a core optimization that accelerates both inference (via speculative multi-token jumps) and training (via compressed token sequences using super-tokens). | ||
|
|
||
| **Success Criteria** | ||
| - Inference: ≥ 1.8× tokens/sec uplift with ≥ 70 % chain acceptance rate | ||
| - Training: ≥ 25 % reduction in effective sequence length and training time | ||
| - Zero quality regression (verified by perplexity and downstream metrics) | ||
| - Fully optional via `BitNetOptions` (enabled by default for new models) | ||
| - Works with any tokenizer and any BitNet checkpoint | ||
|
|
||
| --- | ||
|
|
||
| ## 2. Prerequisites & Integration Points | ||
| - Existing `BitNetTransformer`, `BitNetPaperModel`, and training loop | ||
| - `BitNetOptions` class (for toggles) | ||
| - Existing tokenizer and training corpus | ||
| - Benchmark suite (TinyLlama-1.1B + perplexity) | ||
|
|
||
| --- | ||
|
|
||
| ## 3. Overall Architecture | ||
|
|
||
| ```mermaid | ||
| graph TD | ||
| BitNetPaperModel --> ChainBucketTable | ||
| BucketMiner --> ChainBucketTable | ||
| ChainBucketTable --> InferencePath[Inference: Speculative Decoding] | ||
| ChainBucketTable --> TrainingPath[Training: Sequence Compression] | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## 4. Phase 1: Offline Bucket Mining Pipeline (5–7 days) | ||
| 1. Create `BucketMiner` service that scans tokenized corpora. | ||
| 2. Extract frequent n-grams (n=2 to n=8). | ||
| 3. Score candidates by frequency × conditional probability. | ||
| 4. Pack top candidates into exactly 256 buckets (one byte). | ||
| 5. Store: `byte ChainID → TokenID[] chain + float confidence`. | ||
| 6. Output: `ChainBucketTable` (versioned, < 50 KB). | ||
|
|
||
| **Implementation:** `src/BitNetSharp.Core/Bucketing/BucketMiner.cs` | ||
|
|
||
| --- | ||
|
|
||
| ## 5. Phase 2: Inference-Time Chain-Bucket Speculative Decoding (7–10 days) | ||
| **Core flow:** | ||
| 1. After each token, check last 1–3 tokens against bucket prefixes. | ||
| 2. If match found, speculatively emit continuation tokens from the matching chain. | ||
| 3. Run parallel verification pass: confirm model top-1 prediction matches each chain token. | ||
| 4. Accept tokens sequentially until first mismatch (classic speculative safety). | ||
| 5. Context window updated once for the entire accepted chain. | ||
|
|
||
| **Integration:** | ||
| - Extend `BitNetPaperModel.GenerateResponse()` with optional bucketing path. | ||
| - Add `ChainBucketTable` loaded via `MineAndLoadBuckets()` or `LoadBucketTable()`. | ||
| - Configurable via `BitNetOptions.EnableChainBuckets` and `MaxChainLength`. | ||
|
|
||
| **Implementation:** `src/BitNetSharp.Core/BitNetPaperModel.cs` | ||
|
|
||
| --- | ||
|
|
||
| ## 6. Phase 3: Training-Time Sequence Compression with Super-Tokens (8–12 days) | ||
| **New capability:** During training, replace frequent n-grams with a single first-token placeholder to shorten sequences. | ||
|
|
||
| **Steps:** | ||
| 1. Before each training batch forward pass, scan the prompt sequence for chains. | ||
| 2. Replace matching n-grams with just the first token of the chain. | ||
| 3. During forward pass, the model sees compressed sequences (shorter context = faster training). | ||
| 4. Loss is still computed against the original first target token. | ||
| 5. Periodic re-mining at startup or on demand adapts to corpus content. | ||
|
|
||
| **BitNet specifics:** | ||
| - Compression is applied to the INPUT context only; target tokens are unchanged. | ||
| - Re-quantization schedule unchanged. | ||
| - Expected benefit: 20–35 % reduction in training tokens processed per epoch. | ||
|
|
||
| **Configuration:** `BitNetOptions.EnableSequenceCompression = true` | ||
|
|
||
| **Implementation:** `src/BitNetSharp.Core/BitNetPaperModel.cs` (`CompressSequence` helper) | ||
|
|
||
| --- | ||
|
|
||
| ## 7. Phase 4: Quality Safeguards, Evaluation & Benchmarks (5–7 days) | ||
| 1. Add verification step: every generated chain must match model top-1 probabilities. | ||
| 2. Perplexity check on compressed vs uncompressed validation set. | ||
| 3. Benchmark suite extension: | ||
| - Tokens/sec with/without bucketing | ||
| - Training time per epoch with/without sequence compression | ||
| - Acceptance rate and compression ratio metrics | ||
| 4. Add to existing TinyLlama-1.1B benchmark pipeline. | ||
|
|
||
| --- | ||
|
|
||
| ## 8. Phase 5: CLI, Documentation & Release (3–5 days) | ||
| 1. CLI commands: | ||
| - `dotnet run -- chat "hello" --enable-bucketing` | ||
| - `dotnet run -- train --enable-bucketing` | ||
| - `dotnet run -- datagen --domain code --count 10 --output data.jsonl` | ||
| 2. Update `/docs/bucketing-guide.md` with usage, expected speedups, and quality notes. | ||
| 3. Add to main README as core optimization feature. | ||
| 4. Release with pre-mined bucket tables for common tokenizers. | ||
|
|
||
| **Implementation:** `src/BitNetSharp.App/Program.cs` | ||
|
|
||
| --- | ||
|
|
||
| ## 9. Full UML Catalog (Object & Logic Examples) | ||
|
|
||
| **Inference-Time Flow** | ||
|
|
||
| ```mermaid | ||
| flowchart TD | ||
| A[Last 1-3 Tokens] --> B[Bucket Table Lookup] | ||
| B --> C[Chain Candidate Found?] | ||
| C -->|Yes| D[Expand + Verify Each Token] | ||
| D --> E[Accept Until Mismatch] | ||
| E --> F[Context Updated for Full Accepted Chain] | ||
| C -->|No| G[Normal Single-Token Generation] | ||
| ``` | ||
|
|
||
| **Training-Time Compression Flow** | ||
|
|
||
| ```mermaid | ||
| flowchart TD | ||
| A[Raw Token Sequence] --> B[CompressSequence] | ||
| B --> C[Replace n-grams with Chain First Token] | ||
| C --> D[Compressed Sequence → BitNet Forward] | ||
| D --> E[Loss Computed on Original Target Token] | ||
| E --> F[Backprop on Compressed Sequence] | ||
| ``` | ||
|
|
||
| **Class Structure** | ||
|
|
||
| ```mermaid | ||
| classDiagram | ||
| class ChainBucket { | ||
| +byte ChainId | ||
| +int[] TokenIds | ||
| +float Confidence | ||
| +int Length | ||
| } | ||
| class ChainBucketTable { | ||
| +int Count | ||
| +IReadOnlyList~ChainBucket~ Buckets | ||
| +TryLookupPrefix(contextTail, out chain) bool | ||
| +GetById(chainId) ChainBucket? | ||
| } | ||
| class BucketMiner { | ||
| +Mine(sequences, maxBuckets) ChainBucketTable$ | ||
| } | ||
| class BitNetPaperModel { | ||
| +ChainBucketTable? BucketTable | ||
| +BitNetOptions Options | ||
| +LoadBucketTable(table) | ||
| +MineAndLoadBuckets(examples) ChainBucketTable | ||
| +GenerateResponse(prompt, maxTokens) BitNetGenerationResult | ||
| +Train(examples, epochs) TrainingReport | ||
| } | ||
| BitNetPaperModel --> ChainBucketTable | ||
| BucketMiner --> ChainBucketTable | ||
| ChainBucketTable "1" *-- "0..256" ChainBucket | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## 10. Risk Register & Mitigation | ||
| | Risk | Likelihood | Impact | Mitigation | | ||
| |------|------------|--------|------------| | ||
| | Quality regression from compression | Medium | High | Strong verification + perplexity guardrails | | ||
| | Bucket table staleness | Low | Medium | Periodic re-mining during training | | ||
| | Increased memory for table | Low | Low | 256 buckets only (~few KB) | | ||
|
|
||
| --- | ||
|
|
||
| ## 11. Timeline, Milestones & Effort Estimates (Solo Developer) | ||
| - Phase 1: 5–7 days → "Bucket Mining Ready" | ||
| - Phase 2: 7–10 days → "Inference Bucketing Live" | ||
| - Phase 3: 8–12 days → "Training Compression Live" | ||
| - Phase 4–5: 8–12 days → "Full Release" | ||
|
|
||
| **Total estimated effort:** 35–50 days (highly parallelizable with existing training loop). | ||
|
|
||
| --- | ||
|
|
||
| ## 12. Future Extensions | ||
| - Dynamic bucket updating during training | ||
| - Multi-byte chain IDs for >256 buckets | ||
| - Integration with DataGen SLM for bucket-aware synthetic data | ||
|
|
||
| **End of Document** |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section states that
--enable-bucketing“activates bothEnableChainBucketsandEnableSequenceCompression”, but the CLI wiring currently only passesenableChainBucketswhen constructing the model. Update either the CLI implementation or this statement so the guide matches actual behavior.