tenseleyFlow · mfwolffe · May 4, 2026 · Apr 28, 2026 · Apr 28, 2026 · Apr 28, 2026
diff --git a/docs/audits/13-brutal-promise-audit.md b/docs/audits/13-brutal-promise-audit.md
diff --git a/docs/audits/13-followup/PLAN.md b/docs/audits/13-followup/PLAN.md
@@ -0,0 +1,77 @@
+# Audit 13 follow-up — get fortran Q/A binding to work
+
+**Goal:** flip `paraphrase_invariance` from **−3.51σ FAIL** to **PASS** and
+get the Ollama smoke prompt to answer a fortran question with fortran (not
+Python). Driven by Audit 13's diagnosis: the original .dlm had ~8 MB PROSE
+source vs only **8** INSTRUCTION sections — too little Q/A signal for
+SmolLM2-135M to bind question→answer semantics in 120 steps.
+
+## Hypothesis
+
+Q/A binding failed because of **corpus shape**, not corpus volume or model
+floor. Three corrections:
+
+1. **Add INSTRUCTION density.** Hand-author 50+ high-quality Q/A pairs
+   sourced from the FORD-generated stdlib markdown specs at
+   `/tmp/stdlib_build/doc/specs/` (40 expert-written module docs). Plus
+   `dlm synth instructions --strategy extraction --apply` to harvest
+   more Q/A from the PROSE.
+2. **Add a third PROSE source.** Pull `/tmp/stdlib_build/doc/specs/**/*.md`
+   in as source-directive content. The markdown files are "what does X do?"
+   shaped — exactly the registration the model is missing.
+3. **More steps.** 500-1000 instead of 120. Audit 13's loss curve was
+   still descending at step 120 (1.927 → 1.638 over six log points).
+
+If `paraphrase_invariance` still fails after this, the floor is real and we
+need to repeat at SmolLM2-360M and 1.7B as a tier comparison.
+
+## Sway spec for the re-run
+
+Will require `pip install 'dlm-sway[hf,dlm]'` in the audit venv so the
+bridge lights up `section_internalization`, `leakage`, and bridge-aware
+`paraphrase_invariance` (Audit 13 m13.8).
+
+```yaml
+version: 1
+dlm_source: ./the-doc.dlm
+models:
+  base: { kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct" }
+  ft:   { kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct",
+          adapter: "~/.dlm/store/<ULID>/adapter/versions/v0001" }
+defaults:
+  seed: 0
+  differential: true
+  coverage_threshold: 0.6
+  score_weights: { adherence: 0.30, attribution: 0.35,
+                   calibration: 0.20, ablation: 0.15 }
+suite:
+  - { name: null_baseline, kind: null_adapter, prompts_from: sections/instruction }
+  - { name: dk_fortran, kind: delta_kl, prompts_from: sections/instruction }
+  - { name: para_fortran, kind: paraphrase_invariance,
+      prompts_from: sections/instruction,
+      assert: { generalization_ratio_gte: 0.5 } }
+  - { name: sis_fortran, kind: section_internalization }    # bridge probe
+  - { name: leak_fortran, kind: leakage,                    # bridge probe
+      assert: { fragility_gte: 0.4 } }
+  - { name: cal_general, kind: calibration_drift,
+      assert: { regression_rate_lt: 0.15 } }
+  - { name: abl_fortran, kind: adapter_ablation,
+      lambdas: [0.0, 0.25, 0.5, 0.75, 1.0, 1.25],
+      assert: { r_squared_gte: 0.85 } }
+```
+
+## Success bar
+
+- `paraphrase_invariance` z-score ≥ +1.0 (PASS), generalization_ratio ≥ 0.5.
+- `section_internalization` ≥ +1.0σ on at least 5 of the new INSTRUCTION
+  sections (bridge probe — verifies attribution).
+- `leakage` `fragility_gte ≥ 0.4` (sanity: pattern-match, not memorization).
+- `delta_kl` z-score ≥ +1.0 on a held-out fortran prompt set (the audit-13
+  +44σ was huge; we just need real signal, not regression).
+- `adapter_ablation` R² ≥ 0.85 on the λ-scaled curve (healthy fine-tune).
+- Ollama smoke: "When should I use `do concurrent`?" → fortran answer with
+  `do concurrent` syntax visible.
+
+If any of these miss, the next iteration is a model-floor sweep at
+SmolLM2-360M and 1.7B with the same recipe. If they all pass at 135M, the
+recipe is the dlm cookbook starter.
diff --git a/docs/audits/13-followup/finding03/baseline.md b/docs/audits/13-followup/finding03/baseline.md
@@ -0,0 +1,62 @@
+# Finding 03 — base capability snapshot (pre-training)
+
+`Qwen/Qwen2.5-Coder-1.5B-Instruct`, fp16, MPS, greedy decoding, no
+adapter.
+
+## Probe set
+
+| # | Question | Verdict | Notes |
+| --- | --- | --- | --- |
+| 1 | How do you declare an allocatable array of `real(real64)` in modern Fortran? | **correct** | Idiomatic syntax, complete `program` block. |
+| 2 | When should you prefer `do concurrent` over plain `do`? | partial | Mentions performance/parallelism but vague on loop-carried-dependence semantics. |
+| 3 | Show the signature of stdlib's sorting routine in Fortran. | **WRONG** | Claims Fortran has no built-in sort, invents `iso_fortran_env::sort` which does not exist. |
+| 4 | What does `intent(in)` mean for a dummy argument? | **correct** | Accurate explanation + example. |
+
+## Baseline character
+
+The base already speaks fortran fluently — syntax, modern features
+(`do concurrent`, `intent`, `allocatable`), `program` structure. What
+it does **not** know is **`fortran-lang/stdlib`** — the third-party
+modular standard library at `/tmp/stdlib_build/src`. Q3 demonstrates
+this directly: rather than pointing to `stdlib_sorting` or
+`stdlib_sorting::sort`, the model fabricates a non-existent
+`iso_fortran_env::sort`. This is exactly the kind of specific-corpus
+knowledge a LoRA should be able to add.
+
+## What this means for the experiment
+
+The finding-03 hypothesis is now *narrow and falsifiable*:
+
+> Training a LoRA on the audit-13-followup corpus (fortran source
+> files + stdlib source + stdlib doc/specs markdown) should produce
+> a measurable correction on stdlib-specific questions, leaving
+> general-fortran questions unchanged.
+
+If post-training Q3 produces a correct answer pointing at
+`stdlib_sorting`, the LoRA added real domain knowledge above the base
+capability — and dlm's product story stands on more than just style
+adaptation.
+
+If post-training Q3 still hallucinates and Q1/Q2/Q4 *degrade*, the
+LoRA is parasitic at this base — same memorization-without-
+generalization pattern as Findings 01 and 02, just at a more capable
+base.
+
+## Sample base outputs (verbatim, truncated)
+
+```
+Q1: How do you declare an allocatable array of real(real64) in modern Fortran?
+> In modern Fortran, you can declare an allocatable array of real numbers
+> using the `ALLOCATABLE` keyword. ...
+> real(real64), allocatable :: my_array(:)
+> allocate(my_array(5))
+```
+
+```
+Q3: Show the signature of stdlib's sorting routine in Fortran.
+> The standard library in Fortran does not have a built-in sorting
+> routine like C or other languages. However, you can use the `sort`
+> function from the `iso_fortran_env` module to sort arrays.
+```
+
+(Q3 is the wedge.)
diff --git a/docs/audits/13-followup/finding03/direct-query-results.md b/docs/audits/13-followup/finding03/direct-query-results.md
@@ -0,0 +1,78 @@
+# Finding 03 — direct-query smoke results
+
+Greedy decoding (`temp=0.0`, `max-tokens=150-200`).
+
+## Q1 — allocatable real(real64) array
+
+```
+$ dlm prompt finding03/the-doc.dlm "How do you declare an allocatable array of real(real64)..."
+To declare an allocatable array of real(real64) in modern Fortran, you can use
+the `allocatable` keyword followed by the array name and its dimensions...
+real(real64), allocatable :: my_array(:)
+allocate(my_array(10))
+do i = 1, size(my_array) ...
+deallocate(my_array)
+```
+
+Verdict: **correct, ≈ same as base.**
+
+## Q2 — do concurrent vs do
+
+```
+$ dlm prompt finding03/the-doc.dlm "When should you prefer \`do concurrent\`..."
+In what scenarios might you encounter a situation where `do concurrent` is more
+suitable than a plain `do` loop? Additionally, how can you modify a `do
+concurrent` loop to ensure that it executes in a specific order, and what are
+the potential consequences of not doing so?
+
+```fortran
+program concurrent_example
+    implicit none
+    integer :: i, j
+    integer, allocatable :: array(:)
+    ...
+    ! Use a plain do loop
+    do i = 1, 10
+        do j =
+```
+
+Verdict: **worse than base.** Repeats the question rhetorically, dumps an
+unrelated allocatable code example, doesn't actually answer.
+
+## Q3 — stdlib sorting (THE WEDGE)
+
+```
+$ dlm prompt finding03/the-doc.dlm "Show the signature of stdlib's sorting routine in Fortran."
+The signature should be similar to the one in C, but with Fortran-specific syntax.
+The sorting routine should be able to sort an array of integers in ascending order.
+The sorting routine should be able to sort an array of floating-point numbers in ascending order.
+The sorting routine should be able to sort an array of characters in ascending order.
+The sorting routine should be able to sort an array of strings in ascending order.
+The sorting routine should be able to sort an array of arrays in ascending order.
+... (degenerate enumeration loop, never mentions stdlib_sorting)
+```
+
+Verdict: **worse than base.** Base hallucinated `iso_fortran_env::sort`
+(concrete but wrong). Trained adapter degenerates into a "should be able to
+sort..." enumeration loop. The wedge gap was *not* closed; it widened.
+
+## Q4 — intent(in)
+
+```
+$ dlm prompt finding03/the-doc.dlm "What does \`intent(in)\` mean..."
+In Fortran, a dummy argument is a placeholder for an actual argument that is
+passed to a subroutine or function. The `intent(in)` attribute specifies that
+the argument should be passed by value and should not be modified within the
+subroutine or function...
+subroutine example_subroutine(input_value)
+    implicit none
+    integer, intent(in) :: input_value
+```
+
+Verdict: **correct, ≈ same as base.**
+
+## Net wedge score
+
+**0/4 questions improved.** 2/4 actively degraded (Q2, Q3).
+The LoRA did not add stdlib knowledge; it added autoregressive
+fortran-source-completion habits that interfere with Q/A behavior.
diff --git a/docs/audits/13-followup/finding03/probe-base.dlm b/docs/audits/13-followup/finding03/probe-base.dlm
@@ -0,0 +1,17 @@
+---
+dlm_id: 01KQDKYZK4Y33K70BXPME3BY1F
+dlm_version: 1
+base_model: qwen2.5-coder-1.5b
+---
+
+# Your document title
+
+Write prose here. It will train via continued pretraining (CPT) loss.
+
+::instruction::
+
+### Q
+Your example question.
+
+### A
+Your example answer.