Reproducible diagnostic investigation of a small language model that achieved strong evaluation metrics on structured data extraction, then failed silently in production when input formatting deviated from the training distribution.
This repository contains the full experiment: training pipeline, evaluation suite, production failure reproduction, and mathematical root cause analysis. Every number is verifiable. Every failure is reproducible.
Evaluation metrics measure what a model does on data that looks like the data it trained on. Production measures what a model does on data that doesn't. The gap between these two conditions is where silent failures live.
This experiment demonstrates that gap using a fine-tuned SLM on a structured extraction task. The model passes standard benchmarks (accuracy, F1, exact match) on held-out evaluation data, then produces confidently incorrect outputs when input formatting shifts. The model is not uncertain. It is wrong.
The root cause is not capability. It is distributional assumption. The evaluation never tested what production requires.
Model: Qwen2.5-1.5B (QLoRA fine-tune, 4-bit quantization)
Task: Structured field extraction from semi-structured text (synthetic invoice data)
Training distribution: Consistent formatting (fixed delimiters, ordered fields, uniform whitespace)
Evaluation distribution: Same formatting as training (held-out split)
Production distribution: Format-shifted inputs (reordered fields, mixed delimiters, irregular whitespace, missing optional fields)
Diagnostic suite:
- Extraction accuracy across distribution conditions
- Confidence calibration analysis (predicted confidence vs. actual correctness)
- Attention pattern analysis (what the model learned to attend to)
- Token-level attribution (formatting tokens vs. semantic tokens)
- Distribution divergence measurement (KL divergence between eval and production input distributions)
slm-autopsy/
├── README.md
├── LICENSE
├── CITATION.cff
├── pyproject.toml
├── Makefile
├── configs/
│ ├── training.yaml # QLoRA fine-tuning configuration
│ └── evaluation.yaml # Evaluation and diagnostic parameters
├── src/
│ ├── __init__.py
│ ├── data_generation.py # Synthetic invoice data generator
│ ├── format_shift.py # Production distribution generator
│ ├── train.py # QLoRA fine-tuning pipeline
│ ├── evaluate.py # Standard evaluation metrics
│ ├── production_test.py # Format-shifted production evaluation
│ ├── diagnostics.py # Attention analysis, confidence calibration
│ └── report.py # Diagnostic report generation
├── scripts/
│ ├── run_full_experiment.sh # End-to-end: generate → train → eval → diagnose
│ └── generate_figures.py # Publication-ready diagnostic figures
├── data/ # Generated during experiment
├── notebooks/
│ └── analysis.ipynb # Interactive diagnostic exploration
├── docs/
│ └── methodology.md # Detailed experimental methodology
└── tests/
├── test_data_generation.py
├── test_format_shift.py
└── test_diagnostics.py
# Clone and setup
git clone https://github.com/ByteStack-Labs/slm-autopsy.git
cd slm-autopsy
uv sync
# Run full experiment (generate data → train → evaluate → diagnose)
make experiment
# Run individual stages
make data # Generate synthetic training and production data
make train # Fine-tune Qwen2.5-1.5B with QLoRA
make evaluate # Run standard evaluation (the passing scores)
make production # Run production simulation (the failure)
make diagnose # Run diagnostic suite (the root cause)
make report # Generate diagnostic report with figures- Python 3.11+
- CUDA 12.0+ with 12GB+ VRAM (RTX 4070 or equivalent)
- UV for dependency management
Results are generated during the experiment and stored in data/results/. The diagnostic report is generated as an HTML file with interactive Plotly figures.
Reference results from the baseline run (seed 42, RTX 4070) are included in data/results/.
@misc{moses2026slmautopsy,
title={Production ML Autopsy: The Model That Passed Every Benchmark},
author={Moses, Jesse},
year={2026},
url={https://github.com/ByteStack-Labs/slm-autopsy.git}
}Built by Jesse Moses at ByteStack Labs.