Skip to content

feat(fuzz): add egfx_surface_state state-machine oracle#1337

Open
Greg Lamberson (glamberson) wants to merge 3 commits into
Devolutions:masterfrom
lamco-admin:feat/egfx-surface-state-fuzz
Open

feat(fuzz): add egfx_surface_state state-machine oracle#1337
Greg Lamberson (glamberson) wants to merge 3 commits into
Devolutions:masterfrom
lamco-admin:feat/egfx-surface-state-fuzz

Conversation

@glamberson
Copy link
Copy Markdown
Contributor

Summary

  • Implements target 6 of egfx: extend fuzz coverage to OpenH264-wrapper, ZGFX, multi-frame state, and encoder round-trip #1316 (egfx fuzz-coverage umbrella): the egfx_surface_state state-machine oracle. Sibling of egfx_multi_frame (PR feat(fuzz): add egfx_multi_frame oracle and target #1336) with two distinct properties.
  • Narrowed PDU set: a SurfaceLifecyclePdu enum that contains only the seven state-affecting variants (ResetGraphics, CreateSurface, DeleteSurface, MapSurfaceToOutput, StartFrame, EndFrame, FrameAcknowledge). Each iteration concentrates fuzz pressure on the surface and frame state machine rather than spreading across all 23 GfxPdu variants.
  • Parallel state-machine model: an ExpectedState struct mirrors the client's documented per-surface fields (id, width, height, pixel_format, is_mapped, output_origin_x/y) plus the total_frames_decoded counter. For each successfully-encoded PDU the model is updated to mirror the client's handle_* transitions; after dispatch the client's observable state via the public get_surface and total_frames_decoded getters is asserted against the model.

What this catches

A different bug class from egfx_multi_frame's panic / sanitizer oracle: logic bugs in state transitions that produce wrong but non-crashing observable state. Examples:

  • Client retains a surface after DeleteSurface (model removed it, client did not).
  • Client clobbers is_mapped on a surface that did not receive MapSurfaceToOutput.
  • Client fails to clear all surfaces on ResetGraphics.
  • Client increments total_frames_decoded on a PDU other than EndFrame.

The model encodes the implementation's current documented behavior (e.g., CreateSurface with width == 0 or height == 0 is silently skipped per handle_create_surface, total_frames_decoded is incremented unconditionally by handle_end_frame even when no matching StartFrame preceded it). The oracle catches drift from that behavior. Spec-compliance verification is a separate concern.

What this does NOT catch

State transitions that exist only in private fields without a public getter (current_frame_id, frames_queued). Wiring observable access to those fields is a separate egfx-public-API change, out of scope here.

Dependencies

This PR is stacked on PR #1336 (egfx_multi_frame, target 5) which is stacked on PR #1334 (Arbitrary cascade). CI on this branch reds until both ancestors merge or this branch rebases onto master post-merge. That is the correct mechanical signal of the dependency per the no-exclude-then-follow-up rule.

Validation

  • cargo xtask ci clean locally on 1.89.0: fmt, lints, tests, typos, locks all green.
  • check_egfx_surface_state regression-replay test passes against the seed-empty corpus entry.
  • 15-minute rigorous libFuzzer + ASan fuzz on the final SHA: 2,770,751 iterations in 901 seconds (~3,075 exec/s sustained), peak RSS 102 MB, zero panics, zero sanitizer reports, zero state-model assertion failures. Corpus grew to ~1,840 entries.

Notes

  • The narrowed SurfaceLifecyclePdu enum lives oracle-private inside the function (not exported from ironrdp-egfx), keeping the egfx public surface clean per the fallback shape from the design discussion on egfx: extend fuzz coverage to OpenH264-wrapper, ZGFX, multi-frame state, and encoder round-trip #1316.
  • The oracle uses as _ imports for Arbitrary and DvcProcessor per clippy's unused_trait_names lint, matching the egfx_multi_frame style.
  • The #[expect(clippy::panic, reason = ...)] and # Panics doc section match the precedent from bulk_round_trip for assertion-based oracles.

With this PR the #1316 umbrella has its initial six target slots filled (1 merged, 2+2b+3+5+6 open).

Refs #1316. Depends on #1334 and #1336.

…types

Extends the Arbitrary cascade from PR Devolutions#1272 (ironrdp-pdu) and PR Devolutions#1284
(extension to pdu types in ironrdp-pdu's egfx-related modules) to the
ironrdp-egfx crate's own public type surface.

Adds the arbitrary feature to ironrdp-egfx and adds the standard
#[cfg_attr(feature = "arbitrary", derive(arbitrary::Arbitrary))] gate
to every public PDU type in crates/ironrdp-egfx/src/pdu/:

- common.rs: Point, Color, PixelFormat
- avc.rs: Avc420BitmapStream, Avc444BitmapStream, Avc420Region
- cmd.rs: GfxPdu plus all 23 variant types, RawCapabilitySet,
  CapabilitySet, CapabilityVersion, Codec1Type, Codec2Type, QueueDepth,
  CacheEntryMetadata, and the six CapabilitiesV*Flags bitflags structs.

Three types use a manual Arbitrary impl rather than the derive because
their fields have implicit wire-bit-width constraints the derive cannot
express. derive(Arbitrary) on these types generates values in the full
underlying type range, which the encoder then panics on when packing
into the narrower wire range. Manual impls mask each field to its
spec-defined width:

- Timestamp (cmd.rs): milliseconds and hours into 10 bits each,
  seconds and minutes into 6 bits each. The encoder packs all four
  into a single u32 via set_bits.
- QuantQuality (avc.rs): quantization_parameter into 6 bits. Encoder
  packs into a u8 via set_bits(0..6).
- Encoding bitflag (avc.rs): masked to 2 bits. The bitflag's
  `const _ = !0` clause otherwise accepts any u8 value, but the
  encoder for Avc444BitmapStream packs encoding.bits() into bits
  30..32 of the stream-info field.

The arbitrary feature cascades through ironrdp-pdu/arbitrary so types
in egfx's variant payloads (ExclusiveRectangle, etc.) pick up their
existing Arbitrary impls. bitflags/arbitrary is activated so the
CapabilitiesV*Flags bitflag types get Arbitrary via the bitflags
crate's own derive support.

Adds the ironrdp-egfx/arbitrary case to the xtask feature matrix
(xtask/src/features.rs), mirroring the ironrdp-pdu/arbitrary entry,
so CI verifies the feature builds cleanly.

Default build (no arbitrary feature) is unchanged. Existing tests
pass. cargo xtask check features --case ironrdp-egfx/arbitrary clean.

This is the prerequisite cascade for the egfx_multi_frame fuzz target
(target 5 under Devolutions#1316), which will use Arbitrary-derived Vec<GfxPdu>
to drive a single decoder + surface-cache pair across each fuzz
iteration. The target itself follows in a separate PR. Per the Devolutions#1122
guidance ("either implement Arbitrary manually and reject invalid
values, or generate a raw value and then sanitize it"), the three
manual impls here follow the sanitize-via-mask shape so generated
values always round-trip through Encode.

Refs Devolutions#1316.
Implements target 5 of Devolutions#1316 (egfx fuzz-coverage umbrella): a
multi-frame oracle that drives a sequence of Arbitrary-derived GfxPdus
through a single GraphicsPipelineClient + surface-cache pair across
each fuzz iteration, exposing cross-PDU state corruption that
single-shot fuzzers cannot reach.

Harness shape (per the design note posted as issuecomment-4559459227
on Devolutions#1316). The oracle:

1. Constructs one GraphicsPipelineClient (no-op handler, no H.264
   decoder) at iteration start. The surface cache, state machine, and
   ZGFX decompressor history are owned by this single instance.
2. Initialises the channel by calling DvcProcessor::start so the
   client moves into its post-start state.
3. Uses arbitrary::Unstructured to derive Vec<GfxPdu> from the raw
   fuzz input. Each GfxPdu variant is Arbitrary via the cascade in
   PR Devolutions#1334.
4. For each PDU: encodes back to wire bytes via encode_vec, wraps in a
   single uncompressed ZGFX segment via wrap_uncompressed, and feeds
   the wire bytes to the client's public DvcProcessor::process entry
   point. Errors and panics propagate to libFuzzer naturally.

The harness deliberately routes through the public process() entry
rather than the private handle_pdu so the dispatch path exercises the
same ZGFX-decompress + GfxPdu-decode + per-variant-dispatch + state
machine + surface cache flow that production traffic takes. The
uncompressed ZGFX wrapper bypasses the ZGFX decoder layer (already
covered by egfx_zgfx_decompress in a separate PR) and concentrates
fuzz pressure on the dispatch + state machine.

What this catches: panics or sanitizer reports along the dispatch +
state machine path under adversarially-ordered PDU sequences;
inconsistent surface-cache state under attacker-controlled
CreateSurface / DeleteSurface / Map* orderings; corrupted frame-id
state from interleaved StartFrame / EndFrame / FrameAcknowledge
sequences; ZGFX-wrapper integration bugs separate from the standalone
ZGFX coverage.

What this does NOT catch: cross-frame H.264 decoder state corruption.
The client is constructed with h264_decoder: None, so H264-bearing
PDUs (WireToSurface1 with AVC codecs) skip the H.264 decoder. The
standalone egfx_avc420_decode and egfx_avc444_decode targets cover
the H.264 wrapper. Wiring a real (or mock) H.264 decoder into this
harness can be a follow-up if frame-to-frame H.264 state coverage
surfaces as a gap.

Bug discovery during development. The first smoke fuzz on the
in-progress harness surfaced a bit_field::set_bits panic in
Timestamp::encode where derive(Arbitrary) generated u8/u16 values
exceeding the encoder's 6-bit / 10-bit pack widths. The fix landed
in PR Devolutions#1334 (the cascade prerequisite) as manual Arbitrary impls
for Timestamp, QuantQuality, and the Encoding bitflag, mapping each
field to its wire-allowed range. This is the "sanitize via mask"
shape from Devolutions#1122's body.

Validation. 15-minute rigorous libFuzzer + ASan smoke fuzz on the
final SHA: 3,659,157 iterations in 901 seconds (~4,061 exec/s
sustained), peak RSS 96 MB, zero panics, zero sanitizer reports,
zero OOMs. Corpus grew to ~4,558 entries via libFuzzer's coverage
discovery. The regression-replay test check_egfx_multi_frame passes
against the seed-empty corpus entry.

Depends on PR Devolutions#1334 (Arbitrary cascade across ironrdp-egfx public
PDU types). CI on this PR's branch reds until Devolutions#1334 merges or this
PR's branch rebases onto master post-merge. That's the correct
mechanical signal of the dependency per the no-exclude-then-follow-up
rule.

Refs Devolutions#1316.
Implements target 6 of Devolutions#1316 (egfx fuzz-coverage umbrella): the
egfx_surface_state oracle. Sibling of egfx_multi_frame (target 5, PR
Devolutions#1336) with two distinct properties:

1. The PDU stream is narrowed to a SurfaceLifecyclePdu enum that
   contains only the seven state-affecting variants: ResetGraphics,
   CreateSurface, DeleteSurface, MapSurfaceToOutput, StartFrame,
   EndFrame, FrameAcknowledge. Each iteration concentrates fuzz
   pressure on the surface and frame state machine rather than
   spreading across all 23 GfxPdu variants.

2. The oracle maintains a parallel ExpectedState model alongside the
   client. For each successfully-encoded PDU the model is updated to
   mirror the client's documented state transition (CreateSurface
   inserts a surface unless width or height is zero; DeleteSurface
   removes; MapSurfaceToOutput updates is_mapped and origin;
   ResetGraphics clears all surfaces; EndFrame increments the running
   frame counter). After dispatch the client's observable state via
   the public get_surface and total_frames_decoded getters is asserted
   against the model.

This catches a different bug class from egfx_multi_frame's panic /
sanitizer oracle: logic bugs in state transitions that produce wrong
but non-crashing observable state. Examples covered: client retains
a surface after DeleteSurface, client clobbers is_mapped on a surface
that did not receive MapSurfaceToOutput, client fails to clear all
surfaces on ResetGraphics, client increments total_frames_decoded on
a PDU other than EndFrame.

The model encodes the implementation's current documented behavior
(e.g., zero-dimension CreateSurface is silently skipped per
handle_create_surface), so the oracle catches drift from that
behavior. Spec-compliance verification is a separate concern that
would require a parallel spec-only model.

What this does NOT catch: state transitions that exist only in
private fields without a public getter (current_frame_id,
frames_queued). Exposing observable access to those fields is a
separate egfx-public-API change, out of scope here.

Validation. 15-minute rigorous libFuzzer + ASan smoke fuzz on the
final SHA: 2,770,751 iterations in 901 seconds (~3,075 exec/s
sustained), peak RSS 102 MB, zero panics, zero sanitizer reports,
zero state-model assertion failures. Corpus grew to ~1,840 entries
via libFuzzer's coverage discovery. The check_egfx_surface_state
regression-replay test passes against the seed-empty corpus entry.

Stack dependency. This PR is stacked on PR Devolutions#1336 (egfx_multi_frame,
target 5) which is stacked on PR Devolutions#1334 (Arbitrary cascade). CI on
this branch will red until both ancestors merge or this branch
rebases onto master post-merge.

With this PR the Devolutions#1316 umbrella has its initial six target slots
filled (1 merged, 2+2b+3+5+6 open).

Refs Devolutions#1316. Depends on Devolutions#1334 and Devolutions#1336.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant