Benchmarking freqresp! optimization strategies
Benchmark of strategies for maximizing performance of freqresp! on a representative use case: a 48-state SISO continuous state-space system evaluated over 150 frequencies. A self-contained benchmark environment was used; the package's own freqresp! serves as the correctness oracle and no package source was modified for the comparison.
Variants
| ID |
Strategy |
| V0 |
ControlSystemsBase.freqresp! (oracle, full call incl. factorization) |
| V1 |
serial hand-rolled loop, original ldiv2! (loop-only) |
| V2 |
Threads.@spawn over contiguous frequency chunks |
| V2b |
Polyester.@batch over chunks |
| V2c |
OhMyThreads.@tasks with TaskLocalValue buffers |
| V3 |
serial loop, @turbo ldiv2! (real/imag split) |
| V4 |
@spawn + @turbo ldiv2! |
| V4b |
Polyester.@batch + @turbo ldiv2! |
All variants operate on pre-factored matrices (A=F.H, C=complex.(sys.C*Q), B=Q\sys.B, D) so the "loop-only" timing isolates the per-frequency strategy from the one-time Hessenberg factorization. All pass isapprox(·, ·; rtol=1e-10) against the package baseline.
Results (24 logical cores)
Loop-only time, µs (lower is better):
| Variant |
-t 4 |
-t 8 |
-t 24 |
| V1 serial |
545 |
566 |
565 |
| V3 turbo |
360 |
372 |
372 |
| V2 threaded |
147 |
90 |
106 |
| V2b polyester |
203 |
108 |
95 |
| V2c omt |
154 |
96 |
127 |
| V4 threaded+turbo |
104 |
78 |
101 |
| V4b polyester+turbo |
142 |
80 |
93 |
Full-call baseline (V0, factorization included): ~625–655 µs.
Findings
-
The per-frequency loop, not the factorization, dominates. The serial loop is ~565 µs while the one-time Hessenberg factorization is only ~85 µs (full call ≈ loop + factorization ≈ 650 µs). Optimizing the loop is therefore worthwhile.
-
@turbo is a solid ~1.5× win (565 → ~370 µs serial), more than expected for such short loops, with no threads and negligible accuracy loss (~1e-14 vs baseline).
-
Threading gives ~6–7× but scales sub-linearly: at this size 8 threads beats 24 (150 freqs / 24 threads ≈ 6 freqs each — task overhead and oversubscription dominate). Sweet spot ~8 threads.
-
Best overall is threading + @turbo (~78 µs at 8 threads, a ~7.2× speedup over serial). @spawn+@turbo (V4) is consistently near-best across thread counts; Polyester+@turbo (V4b) edges ahead only at 24 threads.
-
Scheduler choice is secondary. @spawn, Polyester, and OhMyThreads are within ~1.4× of each other; @spawn is the most robust across thread counts and adds no dependency.
The @turbo / complex caveat
LoopVectorization's @turbo does not support Complex. A bare @turbo on the complex ldiv2! loops will not compile, and a naive reinterpret-to-real is silently wrong (complex multiply has re/im cross terms). The @turbo variant therefore carries the state as separate real/imag Float64 arrays and writes out the complex arithmetic by hand, so the two hot j = 1:k-2 loops (X update, u update) become valid real @turbo loops. Result matches the baseline to ~1e-14.
Recommendation
The lowest-risk improvement is dropping in the real/imag-split @turbo ldiv2! (≈1.5× for free, no new threading semantics). If the loop is a known hotspot, add an opt-in threaded path (@spawn over chunks with per-task buffers) — combined with @turbo it reaches ~7× — but keep the default serial, since threading helps only for many frequencies and benefits from a bounded thread count.
Caveats
Single-machine numbers (24 cores, JULIA_EXCLUSIVE=1); the threading sweet spot shifts with core count, and absolute times vary a few % with the random ssrand system.
Benchmarking
freqresp!optimization strategiesBenchmark of strategies for maximizing performance of
freqresp!on a representative use case: a 48-state SISO continuous state-space system evaluated over 150 frequencies. A self-contained benchmark environment was used; the package's ownfreqresp!serves as the correctness oracle and no package source was modified for the comparison.Variants
ControlSystemsBase.freqresp!(oracle, full call incl. factorization)ldiv2!(loop-only)Threads.@spawnover contiguous frequency chunksPolyester.@batchover chunksOhMyThreads.@taskswithTaskLocalValuebuffers@turboldiv2!(real/imag split)@spawn+@turboldiv2!Polyester.@batch+@turboldiv2!All variants operate on pre-factored matrices (
A=F.H,C=complex.(sys.C*Q),B=Q\sys.B,D) so the "loop-only" timing isolates the per-frequency strategy from the one-time Hessenberg factorization. All passisapprox(·, ·; rtol=1e-10)against the package baseline.Results (24 logical cores)
Loop-only time, µs (lower is better):
Full-call baseline (V0, factorization included): ~625–655 µs.
Findings
The per-frequency loop, not the factorization, dominates. The serial loop is ~565 µs while the one-time Hessenberg factorization is only ~85 µs (full call ≈ loop + factorization ≈ 650 µs). Optimizing the loop is therefore worthwhile.
@turbois a solid ~1.5× win (565 → ~370 µs serial), more than expected for such short loops, with no threads and negligible accuracy loss (~1e-14 vs baseline).Threading gives ~6–7× but scales sub-linearly: at this size 8 threads beats 24 (150 freqs / 24 threads ≈ 6 freqs each — task overhead and oversubscription dominate). Sweet spot ~8 threads.
Best overall is threading +
@turbo(~78 µs at 8 threads, a ~7.2× speedup over serial).@spawn+@turbo(V4) is consistently near-best across thread counts;Polyester+@turbo(V4b) edges ahead only at 24 threads.Scheduler choice is secondary.
@spawn,Polyester, andOhMyThreadsare within ~1.4× of each other;@spawnis the most robust across thread counts and adds no dependency.The
@turbo/ complex caveatLoopVectorization's
@turbodoes not supportComplex. A bare@turboon the complexldiv2!loops will not compile, and a naivereinterpret-to-real is silently wrong (complex multiply has re/im cross terms). The@turbovariant therefore carries the state as separate real/imagFloat64arrays and writes out the complex arithmetic by hand, so the two hotj = 1:k-2loops (X update, u update) become valid real@turboloops. Result matches the baseline to ~1e-14.Recommendation
The lowest-risk improvement is dropping in the real/imag-split
@turboldiv2!(≈1.5× for free, no new threading semantics). If the loop is a known hotspot, add an opt-in threaded path (@spawnover chunks with per-task buffers) — combined with@turboit reaches ~7× — but keep the default serial, since threading helps only for many frequencies and benefits from a bounded thread count.Caveats
Single-machine numbers (24 cores,
JULIA_EXCLUSIVE=1); the threading sweet spot shifts with core count, and absolute times vary a few % with the randomssrandsystem.