Skip to content

Benchmark: freqresp! optimization strategies (@turbo, threading) for SISO 48-state x 150 freqs #1062

@baggepinnen

Description

@baggepinnen

Benchmarking freqresp! optimization strategies

Benchmark of strategies for maximizing performance of freqresp! on a representative use case: a 48-state SISO continuous state-space system evaluated over 150 frequencies. A self-contained benchmark environment was used; the package's own freqresp! serves as the correctness oracle and no package source was modified for the comparison.

Variants

ID Strategy
V0 ControlSystemsBase.freqresp! (oracle, full call incl. factorization)
V1 serial hand-rolled loop, original ldiv2! (loop-only)
V2 Threads.@spawn over contiguous frequency chunks
V2b Polyester.@batch over chunks
V2c OhMyThreads.@tasks with TaskLocalValue buffers
V3 serial loop, @turbo ldiv2! (real/imag split)
V4 @spawn + @turbo ldiv2!
V4b Polyester.@batch + @turbo ldiv2!

All variants operate on pre-factored matrices (A=F.H, C=complex.(sys.C*Q), B=Q\sys.B, D) so the "loop-only" timing isolates the per-frequency strategy from the one-time Hessenberg factorization. All pass isapprox(·, ·; rtol=1e-10) against the package baseline.

Results (24 logical cores)

Loop-only time, µs (lower is better):

Variant -t 4 -t 8 -t 24
V1 serial 545 566 565
V3 turbo 360 372 372
V2 threaded 147 90 106
V2b polyester 203 108 95
V2c omt 154 96 127
V4 threaded+turbo 104 78 101
V4b polyester+turbo 142 80 93

Full-call baseline (V0, factorization included): ~625–655 µs.

Findings

  1. The per-frequency loop, not the factorization, dominates. The serial loop is ~565 µs while the one-time Hessenberg factorization is only ~85 µs (full call ≈ loop + factorization ≈ 650 µs). Optimizing the loop is therefore worthwhile.

  2. @turbo is a solid ~1.5× win (565 → ~370 µs serial), more than expected for such short loops, with no threads and negligible accuracy loss (~1e-14 vs baseline).

  3. Threading gives ~6–7× but scales sub-linearly: at this size 8 threads beats 24 (150 freqs / 24 threads ≈ 6 freqs each — task overhead and oversubscription dominate). Sweet spot ~8 threads.

  4. Best overall is threading + @turbo (~78 µs at 8 threads, a ~7.2× speedup over serial). @spawn+@turbo (V4) is consistently near-best across thread counts; Polyester+@turbo (V4b) edges ahead only at 24 threads.

  5. Scheduler choice is secondary. @spawn, Polyester, and OhMyThreads are within ~1.4× of each other; @spawn is the most robust across thread counts and adds no dependency.

The @turbo / complex caveat

LoopVectorization's @turbo does not support Complex. A bare @turbo on the complex ldiv2! loops will not compile, and a naive reinterpret-to-real is silently wrong (complex multiply has re/im cross terms). The @turbo variant therefore carries the state as separate real/imag Float64 arrays and writes out the complex arithmetic by hand, so the two hot j = 1:k-2 loops (X update, u update) become valid real @turbo loops. Result matches the baseline to ~1e-14.

Recommendation

The lowest-risk improvement is dropping in the real/imag-split @turbo ldiv2! (≈1.5× for free, no new threading semantics). If the loop is a known hotspot, add an opt-in threaded path (@spawn over chunks with per-task buffers) — combined with @turbo it reaches ~7× — but keep the default serial, since threading helps only for many frequencies and benefits from a bounded thread count.

Caveats

Single-machine numbers (24 cores, JULIA_EXCLUSIVE=1); the threading sweet spot shifts with core count, and absolute times vary a few % with the random ssrand system.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions