Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions benchmark_freqresp/Project.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[deps]
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
ControlSystemsBase = "aaaaaaaa-a6ca-5380-bf3e-84a91bcd477e"
LoopVectorization = "bdcacae8-1622-11e9-2a5c-532679323890"
OhMyThreads = "67456a42-1dca-4109-a031-0a68de7e3ad5"
Polyester = "f517fe37-dbe3-4b94-8317-1923a5111588"
161 changes: 161 additions & 0 deletions benchmark_freqresp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# `freqresp!` optimization benchmark

Standalone benchmark of strategies for `freqresp!` on the target use case:
a **48-state SISO continuous state-space system over 150 frequencies**.

Nothing in the package source is modified — all candidate implementations live in
`variants.jl`. The package's own `freqresp!` is the correctness oracle.

## Run

```bash
julia --project=. -e 'using Pkg; Pkg.develop(path="../lib/ControlSystemsBase"); \
Pkg.add(["BenchmarkTools","LoopVectorization","Polyester","OhMyThreads"]); \
Pkg.instantiate(); Pkg.precompile()'

JULIA_EXCLUSIVE=1 julia -t auto --project=. benchmark.jl # threads fixed at startup
```

## Variants

| ID | Strategy |
|-----|----------------------------------------------------------------------|
| V0 | `ControlSystemsBase.freqresp!` (oracle, full call incl. factorization) |
| V1 | serial hand-rolled loop, original `ldiv2!` (loop-only) |
| V2 | `Threads.@spawn` over contiguous frequency chunks |
| V2b | `Polyester.@batch` over chunks |
| V2c | `OhMyThreads.@tasks` with `TaskLocalValue` buffers |
| V3 | serial loop, `@turbo` `ldiv2!` (real/imag split) |
| V4 | `@spawn` + `@turbo` `ldiv2!` |
| V4b | `Polyester.@batch` + `@turbo` `ldiv2!` |

All variants operate on pre-factored matrices (`A=F.H`, `C=complex.(sys.C*Q)`,
`B=Q\sys.B`, `D`) so the "loop-only" timing isolates the per-frequency strategy from the
one-time Hessenberg factorization. All pass the `isapprox(·, ·; rtol=1e-10)` correctness
gate against the package baseline.

## The `@turbo` / complex caveat

LoopVectorization's `@turbo` does **not** support `Complex`. A bare `@turbo` on the
complex `ldiv2!` loops will not compile, and a naive `reinterpret`-to-real is *silently
wrong* (complex multiply has re/im cross terms). `ldiv2_turbo!` therefore carries the
state as separate real/imag `Float64` arrays and writes out the complex arithmetic by
hand, so the two hot `j = 1:k-2` loops (X update, u update) become valid real `@turbo`
loops. Result matches the baseline to ~1e-14.

## Results (this machine: 24 logical cores)

Loop-only time, µs (lower is better):

| Variant | -t 4 | -t 8 | -t 24 |
|--------------------|-------|-------|-------|
| V1 serial | 545 | 566 | 565 |
| V3 turbo | 360 | 372 | 372 |
| V2 threaded | 147 | 90 | 106 |
| V2b polyester | 203 | 108 | 95 |
| V2c omt | 154 | 96 | 127 |
| V4 threaded+turbo | **104** | **78** | 101 |
| V4b polyester+turbo| 142 | 80 | **93** |

Full-call baseline (V0, factorization included): ~625–655 µs.

## Findings

1. **The per-frequency loop, not the factorization, dominates.** The serial loop is
~565 µs while the one-time Hessenberg factorization is only ~85 µs (full call ≈ loop +
factorization ≈ 650 µs). So optimizing the loop is worthwhile.

2. **`@turbo` is a solid, free win: ~1.5×** (565 → ~370 µs serial), more than expected for
such short loops, with no threads and negligible accuracy loss. It does require the
manual real/imag split.

3. **Threading gives ~6–7×** but scales sub-linearly: at this size **8 threads beats 24**
(150 freqs / 24 threads ≈ 6 freqs each — task overhead and oversubscription dominate).
The sweet spot here is ~8 threads.

4. **Best overall is threading + `@turbo`** (~78 µs at 8 threads, a **~7.2×** speedup over
serial). `@spawn`+`@turbo` (V4) is consistently near-best across thread counts;
`Polyester`+`@turbo` (V4b) edges ahead only at 24 threads.

5. **Scheduler choice is secondary.** `@spawn`, `Polyester`, and `OhMyThreads` are within
~1.4× of each other; `@spawn` is the most robust across thread counts and adds no
dependency.

## Reactant.jl and the modal reformulation

Reactant.jl traces Julia into MLIR/XLA and wants vectorized array programs. The custom
Givens-based `ldiv2!` (scalar indexing, data-dependent control flow, `givensAlgorithm`)
does **not** trace into efficient XLA. To use Reactant you must first rewrite the
computation as batched array ops — and the natural rewrite is the **modal / pole-residue**
form: eigendecompose `A = V Λ V⁻¹` once, then

```
R(ω) = D + Σⱼ resⱼ / (iω − λⱼ), resⱼ = (C·V)ⱼ · (V⁻¹·B)ⱼ
```

Every frequency is then a pure broadcast + reduction over an `nx × nfreq` grid — no
per-frequency linear solve. Benchmarked in plain Julia (variants M1/M2/M3):

| Variant | µs (8 threads) |
|--------------------|----------------|
| M1 modal serial | 42 |
| M2 modal broadcast | 59 |
| M3 modal threaded | **10** |

So the modal form alone is **13×** (serial) to **54×** (threaded) faster than the current
serial Hessenberg loop, and beats the best Hessenberg micro-optimization (threaded+`@turbo`,
~73 µs) by 7×. Accuracy matched the baseline to ~1e-14 here (`cond(V)=147`).

**Verdict on Reactant for this problem:** not worth it. The data is tiny (48×150 ≈ 7200
complex numbers) and the modal form already runs in ~10 µs on CPU — below the per-call XLA
dispatch / GPU kernel-launch latency (tens of µs). Reactant would add a heavy dependency
and long compile times for no gain at this size. It would only pay off at much larger scale
(thousands of states, very many frequencies, or batches of systems) or on GPU — and there
the same modal array program is what you would feed it.

### When the modal form degrades or fails

The modal form's accuracy is governed by `cond(V)` (the conditioning of the eigenvector
matrix). Checked against a 256-bit `BigFloat` resolvent reference:

| System | cond(V) | Hessenberg relerr | modal relerr |
|-------------------------------------|---------|-------------------|--------------|
| random (near-normal), nx=20 | 2e1 | 2e-15 | 8e-15 ✓ |
| 8 repeated poles at −1 | 1.5e110 | 1e-15 | **1.0** ✗ |
| 12 repeated poles at −1 | 7.6e172 | 2e-15 | **1.0** ✗ |
| clustered eigenvalues (non-normal) | 1e20 | 8e-16 | **8.7e2** ✗ |
| lightly damped, eval across pole | 1e0 | 6e-15 | 2e-14 ✓ |

Failure conditions, in order of severity:

1. **Defective (non-diagonalizable) `A`** — a repeated eigenvalue with a non-trivial Jordan
block. `A = VΛV⁻¹` does not exist (`V` singular), and the simple-pole form cannot
represent the response (needs `rⱼ/(s−λ)ᵏ` terms). Total failure (relerr ≈ 1).
2. **Near-defective / strongly non-normal `A`** — diagonalizable but `cond(V)` huge.
Forming `V⁻¹B` and the residue sum loses ≈ `log₁₀(cond(V))` digits with catastrophic
cancellation; error scales with `cond(V)`.
3. **Clustered eigenvalues** — nearly-parallel eigenvectors → ill-conditioned `V` (a milder
case 2).

These are common in control (repeated integrators, cascaded identical lags,
Butterworth/Bessel clustered poles). Note that evaluating *near a pole* (`iω ≈ λⱼ`) is **not**
a modal-specific weakness — the response genuinely blows up there for every method; with a
well-conditioned `V` the modal form is as accurate as Hessenberg across the resonance.

**Why the Hessenberg solve is immune:** it computes `(iωI − A)⁻¹B` via a backward-stable
linear solve whose accuracy depends on `cond(iωI − A)` (benign except genuinely at a pole),
**not** on `cond(V)`, and it never diagonalizes so it handles defective `A` fine. That
robustness is why the package defaults to it.

**Caveat / rule of thumb:** the modal fast path is safe only when `A` is near-normal with
simple, well-separated poles. Guard it by checking `cond(V)` — beyond ~`1/√eps ≈ 1e8` half
the digits are gone, and it should fall back to the Hessenberg solve. It is an excellent
opt-in fast path for well-conditioned systems, not a safe drop-in default.

## Recommendation

For the package, the lowest-risk improvement is dropping in the real/imag-split `@turbo`
`ldiv2!` (≈1.5× for free, no new threading semantics). If the loop is a known hotspot,
add an opt-in threaded path (`@spawn` over chunks with per-task buffers) — combined with
`@turbo` it reaches ~7× — but keep the default serial, since threading helps only for
many frequencies and benefits from a bounded thread count.
109 changes: 109 additions & 0 deletions benchmark_freqresp/benchmark.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Benchmark freqresp! strategies for a 48-state SISO state-space system over 150 frequencies.
#
# Launch with multiple threads to exercise the threaded variants:
# julia -t auto --project=. benchmark.jl
#
# Variants (see variants.jl):
# V0 baseline ControlSystemsBase.freqresp! (correctness oracle, full call)
# V1 serial hand-rolled transcription, loop-only
# V2 threaded Threads.@spawn over chunks
# V2b polyester Polyester.@batch over chunks
# V2c omt OhMyThreads task-local buffers
# V3 turbo @turbo (real/imag split) ldiv2, serial
# V4 threaded+turbo @spawn + @turbo ldiv2

using ControlSystemsBase, BenchmarkTools, LinearAlgebra, Printf
import LoopVectorization

include("variants.jl")

const NTH = Threads.nthreads()
@info "Benchmark configuration" threads=NTH
NTH == 1 && @warn "Running with a single thread — threaded variant timings are meaningless. Relaunch with `julia -t auto`."

# --- Target system & frequencies ---
G = ssrand(1, 1, 48) # 48-state SISO, continuous
w = exp10.(LinRange(-2, 4, 150)) # 150 log-spaced frequencies
ny, nu = size(G)
const T = ComplexF64
make_R() = Array{T,3}(undef, ny, nu, length(w))

# --- Pre-factored matrices for the loop-only variants ---
F = hessenberg(G.A)
Q = Matrix(F.Q)
A = F.H
C = complex.(G.C * Q)
B = Q \ G.B
D = G.D

# Modal / pole-residue precompute (one-time eigendecomposition) — the Reactant-friendly form
E = eigen(G.A)
λ = ComplexF64.(E.values) # nx eigenvalues
V = ComplexF64.(E.vectors)
ctil = vec(complex.(G.C) * V) # C·V (nx,)
btil = V \ ComplexF64.(G.B[:, 1]) # V⁻¹·B (nx,)
res = ctil .* btil # residues (nx,)
condV = cond(V) # conditioning of the modal transform

# A wrapped "full call" variant that does the factorization inside (apples-to-apples w/ V0)
function freqresp_serial_full!(R, sys, w)
F = hessenberg(sys.A); Q = Matrix(F.Q)
freqresp_serial!(R, F.H, complex.(sys.C*Q), Q\sys.B, sys.D, w)
end

# ----------------------------------------------------------------------------
# Correctness: every variant must match the package baseline.
# ----------------------------------------------------------------------------
Rref = make_R(); ControlSystemsBase.freqresp!(Rref, G, w)

loop_variants = [
("V1 serial", (R)->freqresp_serial!(R, A, C, B, D, w)),
("V2 threaded", (R)->freqresp_threaded!(R, A, C, B, D, w)),
("V2b polyester", (R)->freqresp_polyester!(R, A, C, B, D, w)),
("V2c omt", (R)->freqresp_omt!(R, A, C, B, D, w)),
("V3 turbo", (R)->freqresp_turbo!(R, A, C, B, D, w)),
("V3t tturbo", (R)->freqresp_tturbo!(R, A, C, B, D, w)),
("V4 threaded+turbo", (R)->freqresp_threaded_turbo!(R, A, C, B, D, w)),
("V4b polyester+turbo",(R)->freqresp_polyester_turbo!(R, A, C, B, D, w)),
("M1 modal", (R)->freqresp_modal!(R, λ, res, D, w)),
("M2 modal broadcast", (R)->freqresp_modal_broadcast!(R, λ, res, D, w)),
("M3 modal threaded", (R)->freqresp_modal_threaded!(R, λ, res, D, w)),
]
@info "Modal transform conditioning" condV

println("\n=== Correctness (rtol=1e-10 vs ControlSystemsBase.freqresp!) ===")
for (name, f) in loop_variants
R = make_R(); f(R)
relerr = maximum(abs, R .- Rref) / maximum(abs, Rref)
ok = relerr < 1e-8
@printf(" %-20s %s relerr=%.3e\n", name, ok ? "OK " : "LOOSE", relerr)
end

# ----------------------------------------------------------------------------
# Timing
# ----------------------------------------------------------------------------
Rb = make_R()

println("\n=== Loop-only timing (pre-factored matrices passed in) ===")
results = Tuple{String,Float64}[]
for (name, f) in loop_variants
t = @belapsed $f($Rb)
push!(results, (name, t))
@printf(" %-20s %8.2f µs\n", name, t*1e6)
end

println("\n=== Full-call timing (factorization included, as real freqresp usage) ===")
tb = @belapsed ControlSystemsBase.freqresp!($Rb, $G, $w)
@printf(" %-20s %8.2f µs\n", "V0 baseline", tb*1e6)
tf = @belapsed freqresp_serial_full!($Rb, $G, $w)
@printf(" %-20s %8.2f µs\n", "V1 serial (full)", tf*1e6)

# --- Summary table sorted by loop-only time ---
println("\n=== Summary (loop-only, fastest first) ===")
sort!(results, by = x -> x[2])
fastest = results[1][2]
for (name, t) in results
@printf(" %-20s %8.2f µs (%.2fx vs fastest)\n", name, t*1e6, t/fastest)
end
@printf("\n Full call baseline (V0): %.2f µs — factorization dominates at this size.\n", tb*1e6)
println(" Threads used: $NTH")
45 changes: 45 additions & 0 deletions benchmark_freqresp/modal_accuracy.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
using ControlSystemsBase, LinearAlgebra, Printf

# Modal / pole-residue evaluation (Float64), returns response + cond(V)
function modal_eval(A,B,C,D,w)
E = eigen(A)
λ = ComplexF64.(E.values); V = ComplexF64.(E.vectors)
ctil = vec(complex.(C)*V); btil = V \ ComplexF64.(B[:,1])
res = ctil .* btil
R = ComplexF64[ D[1,1] + sum(res ./ (complex(0.0,ω) .- λ)) for ω in w ]
R, cond(V)
end

# High-precision reference: resolvent in Complex{BigFloat}
function ref_eval(A,B,C,D,w)
setprecision(256) do
Ab=Complex{BigFloat}.(A); Bb=Complex{BigFloat}.(B[:,1])
Cb=Complex{BigFloat}.(C); d=Complex{BigFloat}(D[1,1])
[ (Cb*((complex(BigFloat(0),BigFloat(ω))*I - Ab)\Bb))[1] + d for ω in w ]
end
end

relerr(R, Rref) = maximum(abs.(ComplexF64.(R) .- ComplexF64.(Rref))) / maximum(abs, ComplexF64.(Rref))

function report(name, sys, w)
A,B,C,D = sys.A, sys.B, sys.C, sys.D
Rref = ref_eval(A,B,C,D,w)
Rh = vec(freqresp(sys, w)) # package Hessenberg path
Rm, κ = modal_eval(A,B,C,D,w)
@printf("%-32s nx=%2d cond(V)=%.2e Hessenberg relerr=%.2e modal relerr=%.2e\n",
name, sys.nx, κ, relerr(Rh,Rref), relerr(Rm,Rref))
end

w = exp10.(LinRange(-2, 2, 60))

report("random (normal-ish)", ssrand(1,1,20), w)
report("8 repeated poles at -1", ss(tf(1.0,[1.0,1.0]))^8, w)
report("12 repeated poles at -1", ss(tf(1.0,[1.0,1.0]))^12, w)
# clustered (non-defective but near-defective) eigenvalues via upper-triangular A
let n=20
A = triu(0.1 .* randn(n,n)); A[diagind(A)] .= -1.0 .- 0.001 .* (1:n) # tightly clustered evals
B = randn(n,1); C = randn(1,n); D = zeros(1,1)
report("clustered eigenvalues (triu)", ss(A,B,C,D), w)
end
# lightly damped pole near imaginary axis, evaluate across it
report("lightly damped (zeta=1e-4)", ss(tf(1.0,[1.0, 2e-4, 1.0])), exp10.(LinRange(-1,1,200)))
Loading
Loading