JuliaControl · baggepinnen · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/benchmark_freqresp/Project.toml b/benchmark_freqresp/Project.toml
@@ -0,0 +1,6 @@
+[deps]
+BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
+ControlSystemsBase = "aaaaaaaa-a6ca-5380-bf3e-84a91bcd477e"
+LoopVectorization = "bdcacae8-1622-11e9-2a5c-532679323890"
+OhMyThreads = "67456a42-1dca-4109-a031-0a68de7e3ad5"
+Polyester = "f517fe37-dbe3-4b94-8317-1923a5111588"
diff --git a/benchmark_freqresp/README.md b/benchmark_freqresp/README.md
@@ -0,0 +1,161 @@
+# `freqresp!` optimization benchmark
+
+Standalone benchmark of strategies for `freqresp!` on the target use case:
+a **48-state SISO continuous state-space system over 150 frequencies**.
+
+Nothing in the package source is modified — all candidate implementations live in
+`variants.jl`. The package's own `freqresp!` is the correctness oracle.
+
+## Run
+
+```bash
+julia --project=. -e 'using Pkg; Pkg.develop(path="../lib/ControlSystemsBase"); \
+  Pkg.add(["BenchmarkTools","LoopVectorization","Polyester","OhMyThreads"]); \
+  Pkg.instantiate(); Pkg.precompile()'
+
+JULIA_EXCLUSIVE=1 julia -t auto --project=. benchmark.jl   # threads fixed at startup
+```
+
+## Variants
+
+| ID  | Strategy                                                              |
+|-----|----------------------------------------------------------------------|
+| V0  | `ControlSystemsBase.freqresp!` (oracle, full call incl. factorization) |
+| V1  | serial hand-rolled loop, original `ldiv2!` (loop-only)               |
+| V2  | `Threads.@spawn` over contiguous frequency chunks                    |
+| V2b | `Polyester.@batch` over chunks                                       |
+| V2c | `OhMyThreads.@tasks` with `TaskLocalValue` buffers                   |
+| V3  | serial loop, `@turbo` `ldiv2!` (real/imag split)                     |
+| V4  | `@spawn` + `@turbo` `ldiv2!`                                          |
+| V4b | `Polyester.@batch` + `@turbo` `ldiv2!`                               |
+
+All variants operate on pre-factored matrices (`A=F.H`, `C=complex.(sys.C*Q)`,
+`B=Q\sys.B`, `D`) so the "loop-only" timing isolates the per-frequency strategy from the
+one-time Hessenberg factorization. All pass the `isapprox(·, ·; rtol=1e-10)` correctness
+gate against the package baseline.
+
+## The `@turbo` / complex caveat
+
+LoopVectorization's `@turbo` does **not** support `Complex`. A bare `@turbo` on the
+complex `ldiv2!` loops will not compile, and a naive `reinterpret`-to-real is *silently
+wrong* (complex multiply has re/im cross terms). `ldiv2_turbo!` therefore carries the
+state as separate real/imag `Float64` arrays and writes out the complex arithmetic by
+hand, so the two hot `j = 1:k-2` loops (X update, u update) become valid real `@turbo`
+loops. Result matches the baseline to ~1e-14.
+
+## Results (this machine: 24 logical cores)
+
+Loop-only time, µs (lower is better):
+
+| Variant            | -t 4  | -t 8  | -t 24 |
+|--------------------|-------|-------|-------|
+| V1 serial          | 545   | 566   | 565   |
+| V3 turbo           | 360   | 372   | 372   |
+| V2 threaded        | 147   |  90   | 106   |
+| V2b polyester      | 203   | 108   |  95   |
+| V2c omt            | 154   |  96   | 127   |
+| V4 threaded+turbo  | **104** | **78** | 101 |
+| V4b polyester+turbo| 142   |  80   | **93** |
+
+Full-call baseline (V0, factorization included): ~625–655 µs.
+
+## Findings
+
+1. **The per-frequency loop, not the factorization, dominates.** The serial loop is
+   ~565 µs while the one-time Hessenberg factorization is only ~85 µs (full call ≈ loop +
+   factorization ≈ 650 µs). So optimizing the loop is worthwhile.
+
+2. **`@turbo` is a solid, free win: ~1.5×** (565 → ~370 µs serial), more than expected for
+   such short loops, with no threads and negligible accuracy loss. It does require the
+   manual real/imag split.
+
+3. **Threading gives ~6–7×** but scales sub-linearly: at this size **8 threads beats 24**
+   (150 freqs / 24 threads ≈ 6 freqs each — task overhead and oversubscription dominate).
+   The sweet spot here is ~8 threads.
+
+4. **Best overall is threading + `@turbo`** (~78 µs at 8 threads, a **~7.2×** speedup over
+   serial). `@spawn`+`@turbo` (V4) is consistently near-best across thread counts;
+   `Polyester`+`@turbo` (V4b) edges ahead only at 24 threads.
+
+5. **Scheduler choice is secondary.** `@spawn`, `Polyester`, and `OhMyThreads` are within
+   ~1.4× of each other; `@spawn` is the most robust across thread counts and adds no
+   dependency.
+
+## Reactant.jl and the modal reformulation
+
+Reactant.jl traces Julia into MLIR/XLA and wants vectorized array programs. The custom
+Givens-based `ldiv2!` (scalar indexing, data-dependent control flow, `givensAlgorithm`)
+does **not** trace into efficient XLA. To use Reactant you must first rewrite the
+computation as batched array ops — and the natural rewrite is the **modal / pole-residue**
+form: eigendecompose `A = V Λ V⁻¹` once, then
+
+```
+R(ω) = D + Σⱼ resⱼ / (iω − λⱼ),   resⱼ = (C·V)ⱼ · (V⁻¹·B)ⱼ
+```
+
+Every frequency is then a pure broadcast + reduction over an `nx × nfreq` grid — no
+per-frequency linear solve. Benchmarked in plain Julia (variants M1/M2/M3):
+
+| Variant            | µs (8 threads) |
+|--------------------|----------------|
+| M1 modal serial    | 42             |
+| M2 modal broadcast | 59             |
+| M3 modal threaded  | **10**         |
+
+So the modal form alone is **13×** (serial) to **54×** (threaded) faster than the current
+serial Hessenberg loop, and beats the best Hessenberg micro-optimization (threaded+`@turbo`,
+~73 µs) by 7×. Accuracy matched the baseline to ~1e-14 here (`cond(V)=147`).
+
+**Verdict on Reactant for this problem:** not worth it. The data is tiny (48×150 ≈ 7200
+complex numbers) and the modal form already runs in ~10 µs on CPU — below the per-call XLA
+dispatch / GPU kernel-launch latency (tens of µs). Reactant would add a heavy dependency
+and long compile times for no gain at this size. It would only pay off at much larger scale
+(thousands of states, very many frequencies, or batches of systems) or on GPU — and there
+the same modal array program is what you would feed it.
+
+### When the modal form degrades or fails
+
+The modal form's accuracy is governed by `cond(V)` (the conditioning of the eigenvector
+matrix). Checked against a 256-bit `BigFloat` resolvent reference:
+
+| System                              | cond(V) | Hessenberg relerr | modal relerr |
+|-------------------------------------|---------|-------------------|--------------|
+| random (near-normal), nx=20         | 2e1     | 2e-15             | 8e-15 ✓      |
+| 8 repeated poles at −1              | 1.5e110 | 1e-15             | **1.0** ✗    |
+| 12 repeated poles at −1             | 7.6e172 | 2e-15             | **1.0** ✗    |
+| clustered eigenvalues (non-normal)  | 1e20    | 8e-16             | **8.7e2** ✗  |
+| lightly damped, eval across pole    | 1e0     | 6e-15             | 2e-14 ✓      |
+
+Failure conditions, in order of severity:
+
+1. **Defective (non-diagonalizable) `A`** — a repeated eigenvalue with a non-trivial Jordan
+   block. `A = VΛV⁻¹` does not exist (`V` singular), and the simple-pole form cannot
+   represent the response (needs `rⱼ/(s−λ)ᵏ` terms). Total failure (relerr ≈ 1).
+2. **Near-defective / strongly non-normal `A`** — diagonalizable but `cond(V)` huge.
+   Forming `V⁻¹B` and the residue sum loses ≈ `log₁₀(cond(V))` digits with catastrophic
+   cancellation; error scales with `cond(V)`.
+3. **Clustered eigenvalues** — nearly-parallel eigenvectors → ill-conditioned `V` (a milder
+   case 2).
+
+These are common in control (repeated integrators, cascaded identical lags,
+Butterworth/Bessel clustered poles). Note that evaluating *near a pole* (`iω ≈ λⱼ`) is **not**
+a modal-specific weakness — the response genuinely blows up there for every method; with a
+well-conditioned `V` the modal form is as accurate as Hessenberg across the resonance.
+
+**Why the Hessenberg solve is immune:** it computes `(iωI − A)⁻¹B` via a backward-stable
+linear solve whose accuracy depends on `cond(iωI − A)` (benign except genuinely at a pole),
+**not** on `cond(V)`, and it never diagonalizes so it handles defective `A` fine. That
+robustness is why the package defaults to it.
+
+**Caveat / rule of thumb:** the modal fast path is safe only when `A` is near-normal with
+simple, well-separated poles. Guard it by checking `cond(V)` — beyond ~`1/√eps ≈ 1e8` half
+the digits are gone, and it should fall back to the Hessenberg solve. It is an excellent
+opt-in fast path for well-conditioned systems, not a safe drop-in default.
+
+## Recommendation
+
+For the package, the lowest-risk improvement is dropping in the real/imag-split `@turbo`
+`ldiv2!` (≈1.5× for free, no new threading semantics). If the loop is a known hotspot,
+add an opt-in threaded path (`@spawn` over chunks with per-task buffers) — combined with
+`@turbo` it reaches ~7× — but keep the default serial, since threading helps only for
+many frequencies and benefits from a bounded thread count.
diff --git a/benchmark_freqresp/benchmark.jl b/benchmark_freqresp/benchmark.jl
@@ -0,0 +1,109 @@
+# Benchmark freqresp! strategies for a 48-state SISO state-space system over 150 frequencies.
+#
+# Launch with multiple threads to exercise the threaded variants:
+#   julia -t auto --project=. benchmark.jl
+#
+# Variants (see variants.jl):
+#   V0 baseline        ControlSystemsBase.freqresp!         (correctness oracle, full call)
+#   V1 serial          hand-rolled transcription, loop-only
+#   V2 threaded        Threads.@spawn over chunks
+#   V2b polyester      Polyester.@batch over chunks
+#   V2c omt            OhMyThreads task-local buffers
+#   V3 turbo           @turbo (real/imag split) ldiv2, serial
+#   V4 threaded+turbo  @spawn + @turbo ldiv2
+
+using ControlSystemsBase, BenchmarkTools, LinearAlgebra, Printf
+import LoopVectorization
+
+include("variants.jl")
+
+const NTH = Threads.nthreads()
+@info "Benchmark configuration" threads=NTH
+NTH == 1 && @warn "Running with a single thread — threaded variant timings are meaningless. Relaunch with `julia -t auto`."
+
+# --- Target system & frequencies ---
+G = ssrand(1, 1, 48)                       # 48-state SISO, continuous
+w = exp10.(LinRange(-2, 4, 150))           # 150 log-spaced frequencies
+ny, nu = size(G)
+const T = ComplexF64
+make_R() = Array{T,3}(undef, ny, nu, length(w))
+
+# --- Pre-factored matrices for the loop-only variants ---
+F = hessenberg(G.A)
+Q = Matrix(F.Q)
+A = F.H
+C = complex.(G.C * Q)
+B = Q \ G.B
+D = G.D
+
+# Modal / pole-residue precompute (one-time eigendecomposition) — the Reactant-friendly form
+E = eigen(G.A)
+λ = ComplexF64.(E.values)                  # nx eigenvalues
+V = ComplexF64.(E.vectors)
+ctil = vec(complex.(G.C) * V)              # C·V               (nx,)
+btil = V \ ComplexF64.(G.B[:, 1])          # V⁻¹·B             (nx,)
+res = ctil .* btil                         # residues          (nx,)
+condV = cond(V)                            # conditioning of the modal transform
+
+# A wrapped "full call" variant that does the factorization inside (apples-to-apples w/ V0)
+function freqresp_serial_full!(R, sys, w)
+    F = hessenberg(sys.A); Q = Matrix(F.Q)
+    freqresp_serial!(R, F.H, complex.(sys.C*Q), Q\sys.B, sys.D, w)
+end
+
+# ----------------------------------------------------------------------------
+# Correctness: every variant must match the package baseline.
+# ----------------------------------------------------------------------------
+Rref = make_R(); ControlSystemsBase.freqresp!(Rref, G, w)
+
+loop_variants = [
+    ("V1 serial",          (R)->freqresp_serial!(R, A, C, B, D, w)),
+    ("V2 threaded",        (R)->freqresp_threaded!(R, A, C, B, D, w)),
+    ("V2b polyester",      (R)->freqresp_polyester!(R, A, C, B, D, w)),
+    ("V2c omt",            (R)->freqresp_omt!(R, A, C, B, D, w)),
+    ("V3 turbo",           (R)->freqresp_turbo!(R, A, C, B, D, w)),
+    ("V3t tturbo",         (R)->freqresp_tturbo!(R, A, C, B, D, w)),
+    ("V4 threaded+turbo",  (R)->freqresp_threaded_turbo!(R, A, C, B, D, w)),
+    ("V4b polyester+turbo",(R)->freqresp_polyester_turbo!(R, A, C, B, D, w)),
+    ("M1 modal",           (R)->freqresp_modal!(R, λ, res, D, w)),
+    ("M2 modal broadcast", (R)->freqresp_modal_broadcast!(R, λ, res, D, w)),
+    ("M3 modal threaded",  (R)->freqresp_modal_threaded!(R, λ, res, D, w)),
+]
+@info "Modal transform conditioning" condV
+
+println("\n=== Correctness (rtol=1e-10 vs ControlSystemsBase.freqresp!) ===")
+for (name, f) in loop_variants
+    R = make_R(); f(R)
+    relerr = maximum(abs, R .- Rref) / maximum(abs, Rref)
+    ok = relerr < 1e-8
+    @printf("  %-20s %s   relerr=%.3e\n", name, ok ? "OK   " : "LOOSE", relerr)
+end
+
+# ----------------------------------------------------------------------------
+# Timing
+# ----------------------------------------------------------------------------
+Rb = make_R()
+
+println("\n=== Loop-only timing (pre-factored matrices passed in) ===")
+results = Tuple{String,Float64}[]
+for (name, f) in loop_variants
+    t = @belapsed $f($Rb)
+    push!(results, (name, t))
+    @printf("  %-20s %8.2f µs\n", name, t*1e6)
+end
+
+println("\n=== Full-call timing (factorization included, as real freqresp usage) ===")
+tb = @belapsed ControlSystemsBase.freqresp!($Rb, $G, $w)
+@printf("  %-20s %8.2f µs\n", "V0 baseline", tb*1e6)
+tf = @belapsed freqresp_serial_full!($Rb, $G, $w)
+@printf("  %-20s %8.2f µs\n", "V1 serial (full)", tf*1e6)
+
+# --- Summary table sorted by loop-only time ---
+println("\n=== Summary (loop-only, fastest first) ===")
+sort!(results, by = x -> x[2])
+fastest = results[1][2]
+for (name, t) in results
+    @printf("  %-20s %8.2f µs   (%.2fx vs fastest)\n", name, t*1e6, t/fastest)
+end
+@printf("\n  Full call baseline (V0): %.2f µs  — factorization dominates at this size.\n", tb*1e6)
+println("  Threads used: $NTH")
diff --git a/benchmark_freqresp/modal_accuracy.jl b/benchmark_freqresp/modal_accuracy.jl
@@ -0,0 +1,45 @@
+using ControlSystemsBase, LinearAlgebra, Printf
+
+# Modal / pole-residue evaluation (Float64), returns response + cond(V)
+function modal_eval(A,B,C,D,w)
+    E = eigen(A)
+    λ = ComplexF64.(E.values); V = ComplexF64.(E.vectors)
+    ctil = vec(complex.(C)*V); btil = V \ ComplexF64.(B[:,1])
+    res = ctil .* btil
+    R = ComplexF64[ D[1,1] + sum(res ./ (complex(0.0,ω) .- λ)) for ω in w ]
+    R, cond(V)
+end
+
+# High-precision reference: resolvent in Complex{BigFloat}
+function ref_eval(A,B,C,D,w)
+    setprecision(256) do
+        Ab=Complex{BigFloat}.(A); Bb=Complex{BigFloat}.(B[:,1])
+        Cb=Complex{BigFloat}.(C); d=Complex{BigFloat}(D[1,1])
+        [ (Cb*((complex(BigFloat(0),BigFloat(ω))*I - Ab)\Bb))[1] + d for ω in w ]
+    end
+end
+
+relerr(R, Rref) = maximum(abs.(ComplexF64.(R) .- ComplexF64.(Rref))) / maximum(abs, ComplexF64.(Rref))
+
+function report(name, sys, w)
+    A,B,C,D = sys.A, sys.B, sys.C, sys.D
+    Rref = ref_eval(A,B,C,D,w)
+    Rh   = vec(freqresp(sys, w))                 # package Hessenberg path
+    Rm, κ = modal_eval(A,B,C,D,w)
+    @printf("%-32s nx=%2d  cond(V)=%.2e   Hessenberg relerr=%.2e   modal relerr=%.2e\n",
+            name, sys.nx, κ, relerr(Rh,Rref), relerr(Rm,Rref))
+end
+
+w = exp10.(LinRange(-2, 2, 60))
+
+report("random (normal-ish)", ssrand(1,1,20), w)
+report("8 repeated poles at -1", ss(tf(1.0,[1.0,1.0]))^8, w)
+report("12 repeated poles at -1", ss(tf(1.0,[1.0,1.0]))^12, w)
+# clustered (non-defective but near-defective) eigenvalues via upper-triangular A
+let n=20
+    A = triu(0.1 .* randn(n,n)); A[diagind(A)] .= -1.0 .- 0.001 .* (1:n)  # tightly clustered evals
+    B = randn(n,1); C = randn(1,n); D = zeros(1,1)
+    report("clustered eigenvalues (triu)", ss(A,B,C,D), w)
+end
+# lightly damped pole near imaginary axis, evaluate across it
+report("lightly damped (zeta=1e-4)", ss(tf(1.0,[1.0, 2e-4, 1.0])), exp10.(LinRange(-1,1,200)))