Conversation
There was a problem hiding this comment.
Pull request overview
Adds CCITTFaxDecode support to pdf-object by introducing a pure-Rust CCITT (Group 3/4) decoder and wiring it into stream filter decoding so CCITT-compressed streams can be decompressed via StreamObject::data().
Changes:
- Added
ccittmodule implementing CCITTFaxDecode decoding (Group 3 1D, Group 3 2D, Group 4/MMR) with unit tests. - Wired
Filter::CCITTFaxDecodeintoStreamObject::data()and added a/DecodeParmsextraction helper. - Implemented
Filter::decode_ccitt_fax(.., params)as a thin wrapper over the new decoder.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| crates/pdf-object/src/stream.rs | Adds CCITTFaxDecode handling in StreamObject::data() and parses /DecodeParms per filter index. |
| crates/pdf-object/src/lib.rs | Exposes the new ccitt module publicly. |
| crates/pdf-object/src/filter.rs | Implements CCITTFaxDecode decode entrypoint and documents its behavior. |
| crates/pdf-object/src/ccitt.rs | New CCITT decoder implementation + parameter parsing + unit tests. |
Comments suppressed due to low confidence (1)
crates/pdf-object/src/stream.rs:99
StreamObject::data()documents that it errors on unsupported filter chains, but the default match arm prints to stdout and returns partially decoded data. This can mask failures and produce incorrect output for callers. Consider returning anObjectErrorforFilter::Unsupported(_)/ unknown filters instead ofprintln!+break.
_ => {
println!(
"Unsupported filter in data_with_remaining_filter: {:?}",
filter
);
break;
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1dfe18b to
56fd2a7
Compare
Add `pub mod ccitt` with a pure-Rust CCITTFaxDecode decoder supporting all three encoding modes: Group 3 1D (K=0), Group 3 2D (K>0), and Group 4 / MMR (K<0). Wire it into `StreamObject::data()` with a `ccitt_params_for_filter()` helper that reads `/DecodeParms` for both single-filter and multi-filter streams. The implementation embeds Huffman tables from ITU-T T.4 Appendix A (WHITE/BLACK terminating and make-up codes, common extended make-up codes) and a T.6 mode-code decoder (Pass, Horizontal, Vertical ±0–3). `CCITTFaxParams` is parsed directly from PDF dictionary entries without requiring an `ObjectResolver`. Includes 14 unit tests covering params defaults, bit reader, Huffman run-length decoding, EOL handling, 1D rows (all-white, all-black, mixed), 2D Group 4 rows (V0 / Horizontal modes, all-white, row limit), and row packing with both polarities. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f001490 to
68f1211
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 5 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| loop { | ||
| if rows > 0 && decoded_rows >= rows { | ||
| break; | ||
| } | ||
|
|
||
| // Skip any EOL marker preceding the row (no-op when none present). | ||
| skip_eol(&mut reader); | ||
|
|
||
| if reader.exhausted() { | ||
| break; | ||
| } |
There was a problem hiding this comment.
CCITTFaxParams::end_of_block is never consulted during decoding. For Group 3/4 streams with rows = 0 (the default), this will treat the RTC/EOFB terminator as image data and can emit a spurious final row (often all-white) or stop mid-row depending on bit patterns. Please add explicit RTC (Group 3) / EOFB (Group 4) detection when end_of_block is true and stop decoding before appending another row; also add unit tests that cover termination behavior.
| // Skip any EOL marker preceding the row (no-op when none present). | ||
| skip_eol(&mut reader); | ||
|
|
There was a problem hiding this comment.
skip_eol(&mut reader) is called unconditionally before every row, even when params.end_of_line is false (and for Group 4, which normally has no EOLs). This can mis-synchronize decoding if a data bit sequence happens to match the EOL pattern. Consider only consuming a leading EOL when the encoding mode/params require it (e.g., end_of_line for Group 3 1D, mandatory row EOL for Group 3 2D) and avoid scanning for EOLs in Group 4 unless you are explicitly checking for EOFB.
| /// Extract [`CCITTFaxParams`][crate::ccitt::CCITTFaxParams] for the filter at | ||
| /// `filter_idx` from the stream's `/DecodeParms` dictionary entry. | ||
| /// | ||
| /// Per PDF spec §7.3.8.2, `/DecodeParms` is either a single dictionary (when | ||
| /// there is one filter) or an array of dictionaries (one per filter). Values | ||
| /// are always inline objects, so no object resolver is needed. | ||
| fn ccitt_params_for_filter(dict: &Dictionary, filter_idx: usize) -> crate::ccitt::CCITTFaxParams { | ||
| match dict.get("DecodeParms") { | ||
| Some(ObjectVariant::Dictionary(d)) => crate::ccitt::CCITTFaxParams::from_dictionary(d), | ||
| Some(ObjectVariant::Array(arr)) => arr |
There was a problem hiding this comment.
The helper assumes /DecodeParms values are always inline objects (and ignores ObjectVariant::Reference). In PDF, /DecodeParms can legally be an indirect reference, so this will silently fall back to defaults and potentially decode CCITT data with the wrong parameters (K/Columns/BlackIs1/etc.). Either resolve references here (which likely requires threading an ObjectResolver into StreamObject::data) or detect references and return a clear DecompressionError/update the doc comment to state indirect /DecodeParms is unsupported.
| /// Decodes CCITTFaxDecode (Group 3 / Group 4 fax) compressed stream data. | ||
| /// | ||
| /// # Parameters | ||
| /// | ||
| /// - `stream_data`: The compressed byte stream to decode. | ||
| /// - `params`: Decode parameters parsed from the stream's `/DecodeParms` dictionary. | ||
| /// | ||
| /// # Returns | ||
| /// | ||
| /// The decompressed image data as a packed MSB-first 1-bit-per-pixel `Vec<u8>`. | ||
| /// | ||
| /// # Errors | ||
| /// | ||
| /// Returns [`ObjectError::DecompressionError`] if the stream is truncated or | ||
| /// contains an invalid bit pattern. | ||
| pub fn decode_ccitt_fax( | ||
| stream_data: &[u8], | ||
| params: &crate::ccitt::CCITTFaxParams, | ||
| ) -> Result<Vec<u8>, ObjectError> { | ||
| crate::ccitt::decode(stream_data, params) |
There was a problem hiding this comment.
The decode_ccitt_fax docstring says it returns DecompressionError on truncated/invalid bit patterns, but crate::ccitt::decode() currently only errors for columns == 0 and otherwise decodes best-effort (including on truncation). Please align the documentation with the actual behavior (or change the decoder to return errors as documented).
| /// Whether a block terminator (EOFB / RTC) is present. Default: `true`. | ||
| pub end_of_block: bool, | ||
| /// If `true`, black = 1 and white = 0. Default: `false` (white = 1). | ||
| pub black_is1: bool, | ||
| /// Tolerated number of damaged rows before returning an error. Default: `0`. | ||
| pub damaged_rows_before_error: u32, |
There was a problem hiding this comment.
damaged_rows_before_error is parsed from /DecodeParms but never used by the decoder. This is misleading (callers may assume it has an effect). Either implement damaged-row tracking/erroring per the PDF spec, or remove the field / mark it as currently ignored in the docs.
| // ─── Tests ─────────────────────────────────────────────────────────────────── | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { | ||
| use super::*; | ||
|
|
||
| fn params_1d(columns: u32, rows: u32) -> CCITTFaxParams { | ||
| CCITTFaxParams { | ||
| k: 0, | ||
| columns, | ||
| rows, | ||
| end_of_line: false, | ||
| encoded_byte_align: false, | ||
| end_of_block: true, | ||
| black_is1: false, | ||
| damaged_rows_before_error: 0, | ||
| } | ||
| } | ||
|
|
||
| fn params_g4(columns: u32, rows: u32) -> CCITTFaxParams { | ||
| CCITTFaxParams { | ||
| k: -1, | ||
| columns, | ||
| rows, | ||
| end_of_line: false, | ||
| encoded_byte_align: false, | ||
| end_of_block: true, | ||
| black_is1: false, | ||
| damaged_rows_before_error: 0, | ||
| } | ||
| } |
There was a problem hiding this comment.
PR description says the unit tests cover EOL handling, mixed 1D rows, and Group 4 2D modes (V0/Horizontal, row limit), but the tests in this file currently only cover fill_bits/find_bit basics, a few simple 1D cases, and a Group 4 empty-stream case (no 2D mode coverage). Either expand the tests to match the stated coverage (especially around 2D decoding and end-of-block handling) or update the PR description to reflect the current test scope.
Add
pub mod ccittwith a pure-Rust CCITTFaxDecode decoder supporting all three encoding modes: Group 3 1D (K=0), Group 3 2D (K>0), and Group 4 / MMR (K<0). Wire it intoStreamObject::data()with accitt_params_for_filter()helper that reads/DecodeParmsfor both single-filter and multi-filter streams.The implementation embeds Huffman tables from ITU-T T.4 Appendix A (WHITE/BLACK terminating and make-up codes, common extended make-up codes) and a T.6 mode-code decoder (Pass, Horizontal, Vertical ±0–3).
CCITTFaxParamsis parsed directly from PDF dictionary entries without requiring anObjectResolver.Includes 14 unit tests covering params defaults, bit reader, Huffman run-length decoding, EOL handling, 1D rows (all-white, all-black, mixed), 2D Group 4 rows (V0 / Horizontal modes, all-white, row limit), and row packing with both polarities.