Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
ed8240e
Promote cli/commands.py to cli/commands/ package
mfwolffe Apr 28, 2026
40eb913
Extract doctor_cmd into cli/commands/doctor.py
mfwolffe Apr 28, 2026
a7181e8
Extract migrate_cmd into cli/commands/migrate.py
mfwolffe Apr 28, 2026
1d3b779
Extract pack_cmd into cli/commands/pack.py
mfwolffe Apr 28, 2026
7b71500
Extract unpack_cmd into cli/commands/unpack.py
mfwolffe Apr 28, 2026
3a21000
Extract verify_cmd into cli/commands/verify.py
mfwolffe Apr 28, 2026
c88e02b
Extract templates_list_cmd into cli/commands/templates.py
mfwolffe Apr 28, 2026
c2ca473
Sort cli/commands/__init__.py re-export imports
mfwolffe Apr 28, 2026
9d4f385
Move _previously_accepted + _human_size to cli/commands/_shared.py
mfwolffe Apr 28, 2026
7a4ff3d
Extract repl_cmd into cli/commands/repl.py
mfwolffe Apr 28, 2026
3b251e9
Extract push_cmd into cli/commands/push.py
mfwolffe Apr 28, 2026
cf75e76
Extract pull_cmd into cli/commands/pull.py
mfwolffe Apr 28, 2026
95ad10e
Extract serve_cmd into cli/commands/serve.py
mfwolffe Apr 28, 2026
fd7b5e6
Extract cache_show/prune/clear into cli/commands/cache.py
mfwolffe Apr 28, 2026
340df56
Re-export _parse_duration for test compatibility
mfwolffe Apr 28, 2026
7598f93
Extract harvest_cmd into cli/commands/harvest.py
mfwolffe Apr 28, 2026
125f56e
Extract synth_instructions/revert/list into cli/commands/synth.py
mfwolffe Apr 28, 2026
29a8b6e
Extract preference_mine/apply/revert/list into cli/commands/preferenc…
mfwolffe Apr 28, 2026
c4cc7af
Extract metrics_cmd + metrics_watch_cmd into cli/commands/metrics.py
mfwolffe Apr 28, 2026
a84adea
Extract show_cmd + 9 show helpers into cli/commands/show.py
mfwolffe Apr 28, 2026
dc211cf
Extract init_cmd + 4 init helpers into cli/commands/init.py
mfwolffe Apr 28, 2026
41cac0a
Extract train_cmd + multi-GPU helpers into cli/commands/train.py
mfwolffe Apr 28, 2026
f6d4999
Extract prompt_cmd + VL/audio dispatch helpers into cli/commands/prom…
mfwolffe Apr 28, 2026
2b94a43
Extract export_cmd into cli/commands/export.py + drop stale ARG001 ig…
mfwolffe Apr 28, 2026
5bfd080
Extract metrics_cmd query into dlm.metrics.cli:gather_metrics
mfwolffe Apr 28, 2026
4a930b8
Extract synth_instructions_cmd into dlm.synth.dispatch:run_synth_inst…
mfwolffe Apr 28, 2026
d1ea0f6
Extract preference_mine into dlm.preference.dispatch:run_preference_mine
mfwolffe Apr 28, 2026
457be54
Pin synth dispatcher's build_judge import to dlm.preference.judge
mfwolffe Apr 28, 2026
80c9adb
Extract init scaffold + manifest write into dlm.store.bootstrap:run_init
mfwolffe Apr 29, 2026
f37fac5
Extract show summarizers into dlm.store.show:gather_store_view
mfwolffe Apr 29, 2026
ce2cab1
Extract text-path prompt body into dlm.inference.dispatch:run_prompt
mfwolffe Apr 29, 2026
abbf3a6
Extract train doctor + manifest-provision + run_phases into dlm.train…
mfwolffe Apr 29, 2026
d39aefb
Move audit-13 follow-up findings into versioned docs tree
mfwolffe Apr 30, 2026
4d133cf
Warn about MLX PEFT-adapter bug in dlm prompt --backend help + refere…
mfwolffe Apr 30, 2026
931f6bb
Fix MLX backend silently producing base output for PEFT adapters
mfwolffe Apr 30, 2026
f7f0450
Fail loud when mlx-lm leaves the model un-wrapped after load
mfwolffe Apr 30, 2026
5ac6d8d
Promote normalize_probe_markers to public for cross-package callers
mfwolffe Apr 30, 2026
230cb99
Normalize probe markers in synth + gate instruction parsers
mfwolffe Apr 30, 2026
ae194af
Normalize probe markers in replay snapshot instruction rows
mfwolffe Apr 30, 2026
3ea99a5
Append post-MLX-fix retest note to finding-02
mfwolffe Apr 30, 2026
e07c696
Audit 13-followup Finding 05: Q/A-shape recipe doesn't rescue smol
mfwolffe Apr 30, 2026
1f36c74
Audit 13-followup Finding 05: append sway results
mfwolffe Apr 30, 2026
88e989b
Wrap vllm + mlx-serve target export pipelines in dlm.export.entry
mfwolffe May 3, 2026
3092c52
Wrap llama-server post-export prep + smoke in dlm.export.entry
mfwolffe May 3, 2026
b8d1447
Add direct unit tests for new dispatcher modules
mfwolffe May 3, 2026
4626047
Merge remote-tracking branch 'origin/trunk' into sprint/45-cli-extrac…
mfwolffe May 4, 2026
d4b1b6c
Repoint integration-test build_judge patches to canonical module path
mfwolffe May 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
381 changes: 381 additions & 0 deletions docs/audits/13-brutal-promise-audit.md

Large diffs are not rendered by default.

77 changes: 77 additions & 0 deletions docs/audits/13-followup/PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Audit 13 follow-up — get fortran Q/A binding to work

**Goal:** flip `paraphrase_invariance` from **−3.51σ FAIL** to **PASS** and
get the Ollama smoke prompt to answer a fortran question with fortran (not
Python). Driven by Audit 13's diagnosis: the original .dlm had ~8 MB PROSE
source vs only **8** INSTRUCTION sections — too little Q/A signal for
SmolLM2-135M to bind question→answer semantics in 120 steps.

## Hypothesis

Q/A binding failed because of **corpus shape**, not corpus volume or model
floor. Three corrections:

1. **Add INSTRUCTION density.** Hand-author 50+ high-quality Q/A pairs
sourced from the FORD-generated stdlib markdown specs at
`/tmp/stdlib_build/doc/specs/` (40 expert-written module docs). Plus
`dlm synth instructions --strategy extraction --apply` to harvest
more Q/A from the PROSE.
2. **Add a third PROSE source.** Pull `/tmp/stdlib_build/doc/specs/**/*.md`
in as source-directive content. The markdown files are "what does X do?"
shaped — exactly the registration the model is missing.
3. **More steps.** 500-1000 instead of 120. Audit 13's loss curve was
still descending at step 120 (1.927 → 1.638 over six log points).

If `paraphrase_invariance` still fails after this, the floor is real and we
need to repeat at SmolLM2-360M and 1.7B as a tier comparison.

## Sway spec for the re-run

Will require `pip install 'dlm-sway[hf,dlm]'` in the audit venv so the
bridge lights up `section_internalization`, `leakage`, and bridge-aware
`paraphrase_invariance` (Audit 13 m13.8).

```yaml
version: 1
dlm_source: ./the-doc.dlm
models:
base: { kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct" }
ft: { kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct",
adapter: "~/.dlm/store/<ULID>/adapter/versions/v0001" }
defaults:
seed: 0
differential: true
coverage_threshold: 0.6
score_weights: { adherence: 0.30, attribution: 0.35,
calibration: 0.20, ablation: 0.15 }
suite:
- { name: null_baseline, kind: null_adapter, prompts_from: sections/instruction }
- { name: dk_fortran, kind: delta_kl, prompts_from: sections/instruction }
- { name: para_fortran, kind: paraphrase_invariance,
prompts_from: sections/instruction,
assert: { generalization_ratio_gte: 0.5 } }
- { name: sis_fortran, kind: section_internalization } # bridge probe
- { name: leak_fortran, kind: leakage, # bridge probe
assert: { fragility_gte: 0.4 } }
- { name: cal_general, kind: calibration_drift,
assert: { regression_rate_lt: 0.15 } }
- { name: abl_fortran, kind: adapter_ablation,
lambdas: [0.0, 0.25, 0.5, 0.75, 1.0, 1.25],
assert: { r_squared_gte: 0.85 } }
```

## Success bar

- `paraphrase_invariance` z-score ≥ +1.0 (PASS), generalization_ratio ≥ 0.5.
- `section_internalization` ≥ +1.0σ on at least 5 of the new INSTRUCTION
sections (bridge probe — verifies attribution).
- `leakage` `fragility_gte ≥ 0.4` (sanity: pattern-match, not memorization).
- `delta_kl` z-score ≥ +1.0 on a held-out fortran prompt set (the audit-13
+44σ was huge; we just need real signal, not regression).
- `adapter_ablation` R² ≥ 0.85 on the λ-scaled curve (healthy fine-tune).
- Ollama smoke: "When should I use `do concurrent`?" → fortran answer with
`do concurrent` syntax visible.

If any of these miss, the next iteration is a model-floor sweep at
SmolLM2-360M and 1.7B with the same recipe. If they all pass at 135M, the
recipe is the dlm cookbook starter.
62 changes: 62 additions & 0 deletions docs/audits/13-followup/finding03/baseline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Finding 03 — base capability snapshot (pre-training)

`Qwen/Qwen2.5-Coder-1.5B-Instruct`, fp16, MPS, greedy decoding, no
adapter.

## Probe set

| # | Question | Verdict | Notes |
| --- | --- | --- | --- |
| 1 | How do you declare an allocatable array of `real(real64)` in modern Fortran? | **correct** | Idiomatic syntax, complete `program` block. |
| 2 | When should you prefer `do concurrent` over plain `do`? | partial | Mentions performance/parallelism but vague on loop-carried-dependence semantics. |
| 3 | Show the signature of stdlib's sorting routine in Fortran. | **WRONG** | Claims Fortran has no built-in sort, invents `iso_fortran_env::sort` which does not exist. |
| 4 | What does `intent(in)` mean for a dummy argument? | **correct** | Accurate explanation + example. |

## Baseline character

The base already speaks fortran fluently — syntax, modern features
(`do concurrent`, `intent`, `allocatable`), `program` structure. What
it does **not** know is **`fortran-lang/stdlib`** — the third-party
modular standard library at `/tmp/stdlib_build/src`. Q3 demonstrates
this directly: rather than pointing to `stdlib_sorting` or
`stdlib_sorting::sort`, the model fabricates a non-existent
`iso_fortran_env::sort`. This is exactly the kind of specific-corpus
knowledge a LoRA should be able to add.

## What this means for the experiment

The finding-03 hypothesis is now *narrow and falsifiable*:

> Training a LoRA on the audit-13-followup corpus (fortran source
> files + stdlib source + stdlib doc/specs markdown) should produce
> a measurable correction on stdlib-specific questions, leaving
> general-fortran questions unchanged.

If post-training Q3 produces a correct answer pointing at
`stdlib_sorting`, the LoRA added real domain knowledge above the base
capability — and dlm's product story stands on more than just style
adaptation.

If post-training Q3 still hallucinates and Q1/Q2/Q4 *degrade*, the
LoRA is parasitic at this base — same memorization-without-
generalization pattern as Findings 01 and 02, just at a more capable
base.

## Sample base outputs (verbatim, truncated)

```
Q1: How do you declare an allocatable array of real(real64) in modern Fortran?
> In modern Fortran, you can declare an allocatable array of real numbers
> using the `ALLOCATABLE` keyword. ...
> real(real64), allocatable :: my_array(:)
> allocate(my_array(5))
```

```
Q3: Show the signature of stdlib's sorting routine in Fortran.
> The standard library in Fortran does not have a built-in sorting
> routine like C or other languages. However, you can use the `sort`
> function from the `iso_fortran_env` module to sort arrays.
```

(Q3 is the wedge.)
78 changes: 78 additions & 0 deletions docs/audits/13-followup/finding03/direct-query-results.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Finding 03 — direct-query smoke results

Greedy decoding (`temp=0.0`, `max-tokens=150-200`).

## Q1 — allocatable real(real64) array

```
$ dlm prompt finding03/the-doc.dlm "How do you declare an allocatable array of real(real64)..."
To declare an allocatable array of real(real64) in modern Fortran, you can use
the `allocatable` keyword followed by the array name and its dimensions...
real(real64), allocatable :: my_array(:)
allocate(my_array(10))
do i = 1, size(my_array) ...
deallocate(my_array)
```

Verdict: **correct, ≈ same as base.**

## Q2 — do concurrent vs do

```
$ dlm prompt finding03/the-doc.dlm "When should you prefer \`do concurrent\`..."
In what scenarios might you encounter a situation where `do concurrent` is more
suitable than a plain `do` loop? Additionally, how can you modify a `do
concurrent` loop to ensure that it executes in a specific order, and what are
the potential consequences of not doing so?

```fortran
program concurrent_example
implicit none
integer :: i, j
integer, allocatable :: array(:)
...
! Use a plain do loop
do i = 1, 10
do j =
```

Verdict: **worse than base.** Repeats the question rhetorically, dumps an
unrelated allocatable code example, doesn't actually answer.

## Q3 — stdlib sorting (THE WEDGE)

```
$ dlm prompt finding03/the-doc.dlm "Show the signature of stdlib's sorting routine in Fortran."
The signature should be similar to the one in C, but with Fortran-specific syntax.
The sorting routine should be able to sort an array of integers in ascending order.
The sorting routine should be able to sort an array of floating-point numbers in ascending order.
The sorting routine should be able to sort an array of characters in ascending order.
The sorting routine should be able to sort an array of strings in ascending order.
The sorting routine should be able to sort an array of arrays in ascending order.
... (degenerate enumeration loop, never mentions stdlib_sorting)
```

Verdict: **worse than base.** Base hallucinated `iso_fortran_env::sort`
(concrete but wrong). Trained adapter degenerates into a "should be able to
sort..." enumeration loop. The wedge gap was *not* closed; it widened.

## Q4 — intent(in)

```
$ dlm prompt finding03/the-doc.dlm "What does \`intent(in)\` mean..."
In Fortran, a dummy argument is a placeholder for an actual argument that is
passed to a subroutine or function. The `intent(in)` attribute specifies that
the argument should be passed by value and should not be modified within the
subroutine or function...
subroutine example_subroutine(input_value)
implicit none
integer, intent(in) :: input_value
```

Verdict: **correct, ≈ same as base.**

## Net wedge score

**0/4 questions improved.** 2/4 actively degraded (Q2, Q3).
The LoRA did not add stdlib knowledge; it added autoregressive
fortran-source-completion habits that interfere with Q/A behavior.
17 changes: 17 additions & 0 deletions docs/audits/13-followup/finding03/probe-base.dlm
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
dlm_id: 01KQDKYZK4Y33K70BXPME3BY1F
dlm_version: 1
base_model: qwen2.5-coder-1.5b
---

# Your document title

Write prose here. It will train via continued pretraining (CPT) loss.

::instruction::

### Q
Your example question.

### A
Your example answer.
Loading
Loading