Production ML Autopsy: The Model That Passed Every Benchmark

Reproducible diagnostic investigation of a small language model that achieved strong evaluation metrics on structured data extraction, then failed silently in production when input formatting deviated from the training distribution.

This repository contains the full experiment: training pipeline, evaluation suite, production failure reproduction, and mathematical root cause analysis. Every number is verifiable. Every failure is reproducible.

Thesis

Evaluation metrics measure what a model does on data that looks like the data it trained on. Production measures what a model does on data that doesn't. The gap between these two conditions is where silent failures live.

This experiment demonstrates that gap using a fine-tuned SLM on a structured extraction task. The model passes standard benchmarks (accuracy, F1, exact match) on held-out evaluation data, then produces confidently incorrect outputs when input formatting shifts. The model is not uncertain. It is wrong.

The root cause is not capability. It is distributional assumption. The evaluation never tested what production requires.

Experiment Design

Model: Qwen2.5-1.5B (QLoRA fine-tune, 4-bit quantization)

Task: Structured field extraction from semi-structured text (synthetic invoice data)

Training distribution: Consistent formatting (fixed delimiters, ordered fields, uniform whitespace)

Evaluation distribution: Same formatting as training (held-out split)

Production distribution: Format-shifted inputs (reordered fields, mixed delimiters, irregular whitespace, missing optional fields)

Diagnostic suite:

Extraction accuracy across distribution conditions
Confidence calibration analysis (predicted confidence vs. actual correctness)
Attention pattern analysis (what the model learned to attend to)
Token-level attribution (formatting tokens vs. semantic tokens)
Distribution divergence measurement (KL divergence between eval and production input distributions)

Repository Structure

slm-autopsy/
├── README.md
├── LICENSE
├── CITATION.cff
├── pyproject.toml
├── Makefile
├── configs/
│   ├── training.yaml          # QLoRA fine-tuning configuration
│   └── evaluation.yaml        # Evaluation and diagnostic parameters
├── src/
│   ├── __init__.py
│   ├── data_generation.py     # Synthetic invoice data generator
│   ├── format_shift.py        # Production distribution generator
│   ├── train.py               # QLoRA fine-tuning pipeline
│   ├── evaluate.py            # Standard evaluation metrics
│   ├── production_test.py     # Format-shifted production evaluation
│   ├── diagnostics.py         # Attention analysis, confidence calibration
│   └── report.py              # Diagnostic report generation
├── scripts/
│   ├── run_full_experiment.sh  # End-to-end: generate → train → eval → diagnose
│   └── generate_figures.py     # Publication-ready diagnostic figures
├── data/                       # Generated during experiment
├── notebooks/
│   └── analysis.ipynb          # Interactive diagnostic exploration
├── docs/
│   └── methodology.md          # Detailed experimental methodology
└── tests/
    ├── test_data_generation.py
    ├── test_format_shift.py
    └── test_diagnostics.py

Quick Start

# Clone and setup
git clone https://github.com/ByteStack-Labs/slm-autopsy.git
cd slm-autopsy
uv sync

# Run full experiment (generate data → train → evaluate → diagnose)
make experiment

# Run individual stages
make data        # Generate synthetic training and production data
make train       # Fine-tune Qwen2.5-1.5B with QLoRA
make evaluate    # Run standard evaluation (the passing scores)
make production  # Run production simulation (the failure)
make diagnose    # Run diagnostic suite (the root cause)
make report      # Generate diagnostic report with figures

Requirements

Python 3.11+
CUDA 12.0+ with 12GB+ VRAM (RTX 4070 or equivalent)
UV for dependency management

Results

Results are generated during the experiment and stored in data/results/. The diagnostic report is generated as an HTML file with interactive Plotly figures.

Reference results from the baseline run (seed 42, RTX 4070) are included in data/results/.

License

Apache 2.0

Citation

@misc{moses2026slmautopsy,
  title={Production ML Autopsy: The Model That Passed Every Benchmark},
  author={Moses, Jesse},
  year={2026},
  url={https://github.com/ByteStack-Labs/slm-autopsy.git}
}

Built by Jesse Moses at ByteStack Labs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Production ML Autopsy: The Model That Passed Every Benchmark

Thesis

Experiment Design

Repository Structure

Quick Start

Requirements

Results

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data		data
docs		docs
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Production ML Autopsy: The Model That Passed Every Benchmark

Thesis

Experiment Design

Repository Structure

Quick Start

Requirements

Results

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages