diff --git a/docs/README.md b/docs/README.md index 23002b6..db84177 100644 --- a/docs/README.md +++ b/docs/README.md @@ -32,6 +32,8 @@ dotnet test BitNet-b1.58-Sharp.slnx - [Bucketing implementation plan v1.0](bucketing-implementation-plan-v1.0.md) - [DataGen guide](datagen-guide.md) - [Implementation plan](implementation-plan-v3.md) +- [Full implementation plan: real training + benchmarks + purity v1.0](full-implementation-plan-real-training-benchmarks-purity-v1.0.md) +- [Real training implementation plan v1.0](real-training-implementation-plan-v1.0.md) - [Releases and packaging](releases-and-packaging.md) - [Usage](usage.md) - [Training and visualization](training-and-visualization.md) diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index 97aac87..8810920 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -6,6 +6,8 @@ - [Bucketing implementation plan v1.0](bucketing-implementation-plan-v1.0.md) - [DataGen guide](datagen-guide.md) - [Implementation plan v3 (active)](implementation-plan-v3.md) + - [Full implementation plan: real training + benchmarks + purity v1.0](full-implementation-plan-real-training-benchmarks-purity-v1.0.md) + - [Real training implementation plan v1.0](real-training-implementation-plan-v1.0.md) - [Implementation plan v2 (archived)](implementation-plan-v2.md) - [Implementation plan v1 (archived)](implementation-plan-v1.md) - [Benchmarking and model comparison](benchmarking.md) diff --git a/docs/full-implementation-plan-real-training-benchmarks-purity-v1.0.md b/docs/full-implementation-plan-real-training-benchmarks-purity-v1.0.md new file mode 100644 index 0000000..6e3d90d --- /dev/null +++ b/docs/full-implementation-plan-real-training-benchmarks-purity-v1.0.md @@ -0,0 +1,190 @@ +# Full Implementation Plan: Real Training + Enhanced Benchmarks + Repository Purity v1.0 +**Address All Three Issues in One Cohesive Plan** +**Core Repository – Strictly Domain-Agnostic** + +**Version:** 1.0 +**Date:** March 20, 2026 +**Status:** Ready-to-execute + +> **Dependency note:** WikiText-2 validation download and tokenization are being added in PR #27. This plan assumes that dependency merges first and then consumes those repository-local artifacts. + +--- + +## Table of Contents + +1. [Executive Summary & Success Criteria](#1-executive-summary--success-criteria) +2. [Prerequisites](#2-prerequisites) +3. [Overall Architecture](#3-overall-architecture) +4. [Phase 1: Enforce Repository Purity & Architecture Guidelines (1–2 days)](#4-phase-1-enforce-repository-purity--architecture-guidelines-12-days) +5. [Phase 2: Implement Real Training Loop (7–10 days)](#5-phase-2-implement-real-training-loop-710-days) +6. [Phase 3: Build Enhanced Benchmark Suite with TinyLlama-1.1B (6–8 days)](#6-phase-3-build-enhanced-benchmark-suite-with-tinyllama-11b-68-days) +7. [Phase 4: Create Improved Report that Surfaces Strengths & Deficiencies (3–4 days)](#7-phase-4-create-improved-report-that-surfaces-strengths--deficiencies-34-days) +8. [Phase 5: CI Integration & Release (2 days)](#8-phase-5-ci-integration--release-2-days) +9. [Full UML Catalog](#9-full-uml-catalog) +10. [Risk Register & Mitigation](#10-risk-register--mitigation) +11. [Timeline & Effort Estimates](#11-timeline--effort-estimates) + +--- + +## 1. Executive Summary & Success Criteria + +This plan replaces the stub training, expands benchmarks to include TinyLlama-1.1B, perplexity, and real-world task comparisons, and redesigns the report to clearly show where BitNet wins on speed and memory and where it still needs quality improvements. + +### Success Criteria + +- Training runs multiple epochs with real data and visibly reduces loss +- Benchmarks measure perplexity, reasoning, code, and efficiency on TinyLlama-1.1B +- Report shows zero-based quality delta and clearly flags deficiencies +- Repository remains 100% domain-agnostic with no vertical code + +--- + +## 2. Prerequisites + +- Existing `BitNetModel`, `BitLinear`, tokenizer, and SpecFlow tests +- BenchmarkDotNet already added to the test project +- WikiText-2 validation set downloaded and pre-tokenized by PR #27 + +--- + +## 3. Overall Architecture + +```mermaid +flowchart TD + A[WikiText-2 Loader] --> B[Real Training Loop (Epochs + STE)] + B --> C[BenchmarkDotNet Suite (TinyLlama-1.1B)] + C --> D[Perplexity + Zero-Shot + Code + Efficiency] + D --> E[Improved Report (Strengths vs Deficiencies)] +``` + +--- + +## 4. Phase 1: Enforce Repository Purity & Architecture Guidelines (1–2 days) + +1. Commit `docs/repo-alignment-guidelines.md` from the prior discussion. +2. Update the root `README.md` with a repository-purity banner and no vertical mentions. +3. Add a pull request template that requires a purity checklist. +4. Move any stray domain code, if present, to a new companion repository stub. + +--- + +## 5. Phase 2: Implement Real Training Loop (7–10 days) + +Replace the stub in `BitNetModel.cs` with a training API shaped like this: + +```csharp +public TrainingReport Train(int epochs, IDataLoader loader) +{ + var optimizer = new AdamWOptimizer(3e-4f, 0.1f); + var report = new TrainingReport(); + + for (int e = 0; e < epochs; e++) + { + double totalLoss = 0; + int count = 0; + + foreach (var batch in loader.GetBatches()) + { + var logits = Forward(batch.Input); + var loss = CrossEntropyLoss(logits, batch.Target); + totalLoss += loss.Value * batch.Size; + count += batch.Size; + + loss.BackwardWithSTE(); + optimizer.Step(Parameters); + optimizer.ZeroGrad(); + } + + ReQuantizeAllLayers(); + report.AddEpoch(e, totalLoss / count); + } + + return report; +} +``` + +Implement `IDataLoader`, `AdamWOptimizer`, and `CrossEntropyLoss` with STE support. + +--- + +## 6. Phase 3: Build Enhanced Benchmark Suite with TinyLlama-1.1B (6–8 days) + +Create `tests/BitNetSharp.Tests/Benchmarks/TinyLlamaBenchmark.cs`: + +```csharp +[Config(typeof(BitNetBenchmarkConfig))] +public class TinyLlamaBenchmark +{ + [Benchmark] public void TrainingEpoch() => model.Train(1, wikiLoader); + [Benchmark] public double PerplexityBitNet() => model.CalculatePerplexity(wikiLoader); + [Benchmark] public double ARCEasyAccuracy() => model.EvaluateZeroShot(ARC_Easy); + [Benchmark] public double HumanEvalPass1() => model.EvaluateHumanEval(); +} +``` + +Add a WikiText-2 loader and zero-shot evaluators. + +--- + +## 7. Phase 4: Create Improved Report that Surfaces Strengths & Deficiencies (3–4 days) + +Update `ReportGenerator.cs` to emit a clear comparison table: + +```markdown +Category | Metric | BitNet | Traditional | Delta | Interpretation +----------------------|-------------------------|----------|-------------|----------------|------------------------------- +Language Modeling | WikiText-2 PPL | 18.4 | 17.1 | -7.6% | Minor quality gap +Reasoning | ARC-Easy Accuracy | 61% | 68% | -10.3% | Needs improvement +Code Generation | HumanEval Pass@1 | 19% | 25% | -24% | Significant deficiency +Efficiency | CPU Tokens/sec | 48 | 13 | +269% | Major win +Efficiency | Memory (MB) | 1,150 | 4,600 | 4× smaller | Strong advantage +``` + +Delta is zero-based: `0%` means parity, positive means better, and negative means worse. + +--- + +## 8. Phase 5: CI Integration & Release (2 days) + +- Add a nightly benchmark job in GitHub Actions +- Publish the report to `docs/benchmarks/latest.html` +- Tag a release when perplexity delta and speed targets are met + +--- + +## 9. Full UML Catalog + +### Full Pipeline + +```mermaid +flowchart TD + A[WikiText-2] --> B[Real Training] + B --> C[Enhanced Benchmarks] + C --> D[Improved Report] + D --> E[Actionable Insights] +``` + +--- + +## 10. Risk Register & Mitigation + +| Risk | Likelihood | Mitigation | +|------|------------|------------| +| Training still stub-like | High | Enforce a minimum of 3 epochs plus a real data loader | +| Report misleading | Medium | Use zero-based delta plus explicit better/worse labels | +| Scope creep | High | Require a purity checklist in every PR | + +--- + +## 11. Timeline & Effort Estimates + +| Phase | Estimate | +|------|----------| +| Phase 1: Enforce Repository Purity & Architecture Guidelines | 1–2 days | +| Phase 2: Implement Real Training Loop | 7–10 days | +| Phase 3: Build Enhanced Benchmark Suite with TinyLlama-1.1B | 6–8 days | +| Phase 4: Create Improved Report that Surfaces Strengths & Deficiencies | 3–4 days | +| Phase 5: CI Integration & Release | 2 days | +| **Total** | **19–26 days** | + +This plan keeps all work inside the core repository while remaining strictly domain-agnostic. It is intended to address stub training, benchmark quality, and report clarity as one coordinated roadmap. diff --git a/docs/real-training-implementation-plan-v1.0.md b/docs/real-training-implementation-plan-v1.0.md new file mode 100644 index 0000000..a2c3273 --- /dev/null +++ b/docs/real-training-implementation-plan-v1.0.md @@ -0,0 +1,219 @@ +# Implementation Plan for Real Training in BitNet-b1.58-Sharp v1.0 +**Replace Stub Training with Full Epochs, STE Backprop, Optimizer & Perplexity Validation** +**Core Repository – Domain-Agnostic** + +**Version:** 1.0 +**Date:** March 20, 2026 +**Status:** Ready-to-execute blueprint + +> **Dependency note:** WikiText-2 validation download and tokenization are being added in PR #27. This plan assumes that dependency merges first and then reuses those repository-local artifacts. + +--- + +## Table of Contents + +1. [Executive Summary & Success Criteria](#1-executive-summary--success-criteria) +2. [Prerequisites & Current State](#2-prerequisites--current-state) +3. [Overall Training Architecture](#3-overall-training-architecture) +4. [Phase 1: WikiText-2 Data Loader & Tokenization (2–3 days)](#4-phase-1-wikitext-2-data-loader--tokenization-23-days) +5. [Phase 2: Real Train Method with Epochs, Batches & STE (5–7 days)](#5-phase-2-real-train-method-with-epochs-batches--ste-57-days) +6. [Phase 3: AdamW Optimizer & Gradient Updates (3–4 days)](#6-phase-3-adamw-optimizer--gradient-updates-34-days) +7. [Phase 4: Perplexity Evaluation on WikiText-2 (2–3 days)](#7-phase-4-perplexity-evaluation-on-wikitext-2-23-days) +8. [Phase 5: BenchmarkDotNet Integration & Reporting (3–4 days)](#8-phase-5-benchmarkdotnet-integration--reporting-34-days) +9. [Phase 6: Final Validation & CI Integration (2 days)](#9-phase-6-final-validation--ci-integration-2-days) +10. [Full UML Catalog](#10-full-uml-catalog) +11. [Risk Register & Mitigation](#11-risk-register--mitigation) +12. [Timeline & Effort Estimates](#12-timeline--effort-estimates) + +--- + +## 1. Executive Summary & Success Criteria + +Goal: Replace the current stub training with a **real, measurable training loop** that performs multiple epochs, computes loss, applies STE backprop, updates weights via AdamW, and reports perplexity on WikiText-2. + +### Success Criteria + +- Training runs multiple epochs and visibly reduces loss +- Perplexity on WikiText-2 validation is computed and reported (BitNet vs FP16 baseline) +- BenchmarkDotNet measures training time, tokens/sec, memory, and perplexity delta +- Report includes side-by-side TinyLlama-1.1B comparison +- Training no longer finishes in seconds — realistic duration on CPU/GPU + +--- + +## 2. Prerequisites & Current State + +- Existing `BitNetModel` and `BitLinear` with STE forward pass already implemented +- WikiText-2 raw validation set downloaded and tokenized by PR #27 (one-time dependency) +- BenchmarkDotNet already added to the test project (from prior benchmark patches) + +--- + +## 3. Overall Training Architecture + +```mermaid +flowchart TD + A[WikiText-2 Validation Tokens] --> B[DataLoader (Batching)] + B --> C[BitNetModel.Train(epochs)] + C --> D[For each epoch] + D --> E[Forward Pass (quantized)] + E --> F[Cross-Entropy Loss] + F --> G[STE Backward] + G --> H[AdamW Optimizer Step] + H --> I[Periodic Re-quantization] + I --> J[Perplexity Calculation] + J --> K[Benchmark Report] +``` + +--- + +## 4. Phase 1: WikiText-2 Data Loader & Tokenization (2–3 days) + +1. Consume the repository-local WikiText-2 artifacts added by PR #27. +2. Add a tokenizer helper to convert raw text to token IDs by reusing the existing tokenizer where needed. +3. Create a `WikiTextDataLoader` class that yields batches of shape `(batchSize, seqLen)`. +4. Cache or reuse the tokenized validation set in the test project for fast loading. + +--- + +## 5. Phase 2: Real Train Method with Epochs, Batches & STE (5–7 days) + +Update `BitNetModel` with a training API shaped like this: + +```csharp +public TrainingReport Train(int epochs, IDataLoader dataLoader) +{ + var optimizer = new AdamWOptimizer(lr: 3e-4f, weightDecay: 0.1f); + var report = new TrainingReport(); + + for (int epoch = 0; epoch < epochs; epoch++) + { + double totalLoss = 0; + int tokenCount = 0; + + foreach (var batch in dataLoader.GetBatches()) + { + var logits = Forward(batch.Input); // quantized forward + var loss = CrossEntropyLoss(logits, batch.Target); + totalLoss += loss.Value * batch.Size; + tokenCount += batch.Size; + + loss.BackwardWithSTE(); // straight-through estimator + optimizer.Step(Parameters); + optimizer.ZeroGrad(); + } + + report.AddEpoch(epoch, totalLoss / tokenCount); + ReQuantizeAllLayers(); // periodic re-quantization + } + + return report; +} +``` + +--- + +## 6. Phase 3: AdamW Optimizer & Gradient Updates (3–4 days) + +Implement a simple `AdamWOptimizer` class, or reuse an existing one if present, with: + +- Momentum +- Variance +- Weight decay +- Support for ternary weight scaling (`γ`) +- In-place updates compatible with `BitLinear` + +--- + +## 7. Phase 4: Perplexity Evaluation on WikiText-2 (2–3 days) + +Add a validation method to `BitNetModel`: + +```csharp +public double CalculatePerplexity(IDataLoader validationLoader) +{ + double totalNLL = 0; + int tokenCount = 0; + + foreach (var batch in validationLoader.GetBatches()) + { + var logits = Forward(batch.Input); + var loss = CrossEntropyLoss(logits, batch.Target); + totalNLL += loss.Value * batch.Size; + tokenCount += batch.Size; + } + + return Math.Exp(totalNLL / tokenCount); +} +``` + +--- + +## 8. Phase 5: BenchmarkDotNet Integration & Reporting (3–4 days) + +Update `TinyLlamaBenchmark.cs`, or create it if it is missing, with: + +```csharp +[Benchmark] +public double PerplexityBitNet() => _bitnetModel.CalculatePerplexity(wikiLoader); + +[Benchmark] +public void TrainingEpoch() => _bitnetModel.Train(1, trainingLoader); +``` + +Enhance the report generator to include: + +- Training time per epoch +- Perplexity before and after training +- BitNet vs FP16 baseline comparison + +--- + +## 9. Phase 6: Final Validation & CI Integration (2 days) + +- Add an integration test that runs 3 epochs and verifies loss decreases +- Update CI to run the full benchmark suite on a nightly schedule +- Generate HTML and JSON reports with tables and charts + +--- + +## 10. Full UML Catalog + +### Training Loop Flow + +```mermaid +flowchart TD + A[WikiText-2 Loader] --> B[Epoch Loop] + B --> C[Batch Forward (BitLinear)] + C --> D[Cross-Entropy Loss] + D --> E[STE Backward] + E --> F[AdamW Step] + F --> G[Re-quantize] + G --> H[Perplexity Calc] +``` + +--- + +## 11. Risk Register & Mitigation + +| Risk | Likelihood | Impact | Mitigation | +|------|------------|--------|------------| +| Training still too fast | High | High | Enforce a minimum of 3 epochs and a real WikiText loader | +| STE gradient issues | Medium | High | Add a unit test that verifies gradient flow on a small batch | +| Memory explosion | Low | Medium | Use a small batch size (8–32) plus gradient clipping | + +--- + +## 12. Timeline & Effort Estimates + +| Phase | Estimate | +|------|----------| +| Phase 1: WikiText-2 Data Loader & Tokenization | 2–3 days | +| Phase 2: Real Train Method with Epochs, Batches & STE | 5–7 days | +| Phase 3: AdamW Optimizer & Gradient Updates | 3–4 days | +| Phase 4: Perplexity Evaluation on WikiText-2 | 2–3 days | +| Phase 5: BenchmarkDotNet Integration & Reporting | 3–4 days | +| Phase 6: Final Validation & CI Integration | 2 days | +| **Total** | **17–23 days** | + +This plan is intentionally scoped to the core repository and remains domain-agnostic. It focuses on replacing stubbed training behavior with a measurable, benchmarked, paper-aligned training path.