Benchmark: freqresp! optimization strategies (@turbo, threading) for SISO 48-state x 150 freqs

## Benchmarking `freqresp!` optimization strategies

Benchmark of strategies for maximizing performance of `freqresp!` on a representative use case: a **48-state SISO continuous state-space system evaluated over 150 frequencies**. A self-contained benchmark environment was used; the package's own `freqresp!` serves as the correctness oracle and no package source was modified for the comparison.

### Variants

| ID  | Strategy |
|-----|----------|
| V0  | `ControlSystemsBase.freqresp!` (oracle, full call incl. factorization) |
| V1  | serial hand-rolled loop, original `ldiv2!` (loop-only) |
| V2  | `Threads.@spawn` over contiguous frequency chunks |
| V2b | `Polyester.@batch` over chunks |
| V2c | `OhMyThreads.@tasks` with `TaskLocalValue` buffers |
| V3  | serial loop, `@turbo` `ldiv2!` (real/imag split) |
| V4  | `@spawn` + `@turbo` `ldiv2!` |
| V4b | `Polyester.@batch` + `@turbo` `ldiv2!` |

All variants operate on pre-factored matrices (`A=F.H`, `C=complex.(sys.C*Q)`, `B=Q\sys.B`, `D`) so the "loop-only" timing isolates the per-frequency strategy from the one-time Hessenberg factorization. All pass `isapprox(·, ·; rtol=1e-10)` against the package baseline.

### Results (24 logical cores)

Loop-only time, µs (lower is better):

| Variant            | -t 4  | -t 8  | -t 24 |
|--------------------|-------|-------|-------|
| V1 serial          | 545   | 566   | 565   |
| V3 turbo           | 360   | 372   | 372   |
| V2 threaded        | 147   |  90   | 106   |
| V2b polyester      | 203   | 108   |  95   |
| V2c omt            | 154   |  96   | 127   |
| V4 threaded+turbo  | **104** | **78** | 101 |
| V4b polyester+turbo| 142   |  80   | **93** |

Full-call baseline (V0, factorization included): ~625–655 µs.

### Findings

1. **The per-frequency loop, not the factorization, dominates.** The serial loop is ~565 µs while the one-time Hessenberg factorization is only ~85 µs (full call ≈ loop + factorization ≈ 650 µs). Optimizing the loop is therefore worthwhile.

2. **`@turbo` is a solid ~1.5× win** (565 → ~370 µs serial), more than expected for such short loops, with no threads and negligible accuracy loss (~1e-14 vs baseline).

3. **Threading gives ~6–7×** but scales sub-linearly: at this size **8 threads beats 24** (150 freqs / 24 threads ≈ 6 freqs each — task overhead and oversubscription dominate). Sweet spot ~8 threads.

4. **Best overall is threading + `@turbo`** (~78 µs at 8 threads, a **~7.2×** speedup over serial). `@spawn`+`@turbo` (V4) is consistently near-best across thread counts; `Polyester`+`@turbo` (V4b) edges ahead only at 24 threads.

5. **Scheduler choice is secondary.** `@spawn`, `Polyester`, and `OhMyThreads` are within ~1.4× of each other; `@spawn` is the most robust across thread counts and adds no dependency.

### The `@turbo` / complex caveat

LoopVectorization's `@turbo` does **not** support `Complex`. A bare `@turbo` on the complex `ldiv2!` loops will not compile, and a naive `reinterpret`-to-real is *silently wrong* (complex multiply has re/im cross terms). The `@turbo` variant therefore carries the state as separate real/imag `Float64` arrays and writes out the complex arithmetic by hand, so the two hot `j = 1:k-2` loops (X update, u update) become valid real `@turbo` loops. Result matches the baseline to ~1e-14.

### Recommendation

The lowest-risk improvement is dropping in the real/imag-split `@turbo` `ldiv2!` (≈1.5× for free, no new threading semantics). If the loop is a known hotspot, add an opt-in threaded path (`@spawn` over chunks with per-task buffers) — combined with `@turbo` it reaches ~7× — but keep the default serial, since threading helps only for many frequencies and benefits from a bounded thread count.

### Caveats

Single-machine numbers (24 cores, `JULIA_EXCLUSIVE=1`); the threading sweet spot shifts with core count, and absolute times vary a few % with the random `ssrand` system.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: freqresp! optimization strategies (@turbo, threading) for SISO 48-state x 150 freqs #1062

Benchmarking `freqresp!` optimization strategies

Variants

Results (24 logical cores)

Findings

The `@turbo` / complex caveat

Recommendation

Caveats

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ID	Strategy
V0	`ControlSystemsBase.freqresp!` (oracle, full call incl. factorization)
V1	serial hand-rolled loop, original `ldiv2!` (loop-only)
V2	`Threads.@spawn` over contiguous frequency chunks
V2b	`Polyester.@batch` over chunks
V2c	`OhMyThreads.@tasks` with `TaskLocalValue` buffers
V3	serial loop, `@turbo` `ldiv2!` (real/imag split)
V4	`@spawn` + `@turbo` `ldiv2!`
V4b	`Polyester.@batch` + `@turbo` `ldiv2!`

Variant	-t 4	-t 8	-t 24
V1 serial	545	566	565
V3 turbo	360	372	372
V2 threaded	147	90	106
V2b polyester	203	108	95
V2c omt	154	96	127
V4 threaded+turbo	104	78	101
V4b polyester+turbo	142	80	93

Benchmark: freqresp! optimization strategies (@turbo, threading) for SISO 48-state x 150 freqs #1062

Description

Benchmarking freqresp! optimization strategies

Variants

Results (24 logical cores)

Findings

The @turbo / complex caveat

Recommendation

Caveats

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Benchmarking `freqresp!` optimization strategies

The `@turbo` / complex caveat