Skip to content

pdf-object: implement CCITTFaxDecode filter#121

Open
Velli20 wants to merge 1 commit intomainfrom
ccitt-fax-filter
Open

pdf-object: implement CCITTFaxDecode filter#121
Velli20 wants to merge 1 commit intomainfrom
ccitt-fax-filter

Conversation

@Velli20
Copy link
Copy Markdown
Owner

@Velli20 Velli20 commented Mar 28, 2026

Add pub mod ccitt with a pure-Rust CCITTFaxDecode decoder supporting all three encoding modes: Group 3 1D (K=0), Group 3 2D (K>0), and Group 4 / MMR (K<0). Wire it into StreamObject::data() with a ccitt_params_for_filter() helper that reads /DecodeParms for both single-filter and multi-filter streams.

The implementation embeds Huffman tables from ITU-T T.4 Appendix A (WHITE/BLACK terminating and make-up codes, common extended make-up codes) and a T.6 mode-code decoder (Pass, Horizontal, Vertical ±0–3). CCITTFaxParams is parsed directly from PDF dictionary entries without requiring an ObjectResolver.

Includes 14 unit tests covering params defaults, bit reader, Huffman run-length decoding, EOL handling, 1D rows (all-white, all-black, mixed), 2D Group 4 rows (V0 / Horizontal modes, all-white, row limit), and row packing with both polarities.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CCITTFaxDecode support to pdf-object by introducing a pure-Rust CCITT (Group 3/4) decoder and wiring it into stream filter decoding so CCITT-compressed streams can be decompressed via StreamObject::data().

Changes:

  • Added ccitt module implementing CCITTFaxDecode decoding (Group 3 1D, Group 3 2D, Group 4/MMR) with unit tests.
  • Wired Filter::CCITTFaxDecode into StreamObject::data() and added a /DecodeParms extraction helper.
  • Implemented Filter::decode_ccitt_fax(.., params) as a thin wrapper over the new decoder.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
crates/pdf-object/src/stream.rs Adds CCITTFaxDecode handling in StreamObject::data() and parses /DecodeParms per filter index.
crates/pdf-object/src/lib.rs Exposes the new ccitt module publicly.
crates/pdf-object/src/filter.rs Implements CCITTFaxDecode decode entrypoint and documents its behavior.
crates/pdf-object/src/ccitt.rs New CCITT decoder implementation + parameter parsing + unit tests.
Comments suppressed due to low confidence (1)

crates/pdf-object/src/stream.rs:99

  • StreamObject::data() documents that it errors on unsupported filter chains, but the default match arm prints to stdout and returns partially decoded data. This can mask failures and produce incorrect output for callers. Consider returning an ObjectError for Filter::Unsupported(_) / unknown filters instead of println! + break.
                _ => {
                    println!(
                        "Unsupported filter in data_with_remaining_filter: {:?}",
                        filter
                    );
                    break;
                }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add `pub mod ccitt` with a pure-Rust CCITTFaxDecode decoder supporting
all three encoding modes: Group 3 1D (K=0), Group 3 2D (K>0), and
Group 4 / MMR (K<0). Wire it into `StreamObject::data()` with a
`ccitt_params_for_filter()` helper that reads `/DecodeParms` for both
single-filter and multi-filter streams.

The implementation embeds Huffman tables from ITU-T T.4 Appendix A
(WHITE/BLACK terminating and make-up codes, common extended make-up
codes) and a T.6 mode-code decoder (Pass, Horizontal, Vertical ±0–3).
`CCITTFaxParams` is parsed directly from PDF dictionary entries without
requiring an `ObjectResolver`.

Includes 14 unit tests covering params defaults, bit reader, Huffman
run-length decoding, EOL handling, 1D rows (all-white, all-black,
mixed), 2D Group 4 rows (V0 / Horizontal modes, all-white, row limit),
and row packing with both polarities.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +723 to +733
loop {
if rows > 0 && decoded_rows >= rows {
break;
}

// Skip any EOL marker preceding the row (no-op when none present).
skip_eol(&mut reader);

if reader.exhausted() {
break;
}
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CCITTFaxParams::end_of_block is never consulted during decoding. For Group 3/4 streams with rows = 0 (the default), this will treat the RTC/EOFB terminator as image data and can emit a spurious final row (often all-white) or stop mid-row depending on bit patterns. Please add explicit RTC (Group 3) / EOFB (Group 4) detection when end_of_block is true and stop decoding before appending another row; also add unit tests that cover termination behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +728 to +730
// Skip any EOL marker preceding the row (no-op when none present).
skip_eol(&mut reader);

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip_eol(&mut reader) is called unconditionally before every row, even when params.end_of_line is false (and for Group 4, which normally has no EOLs). This can mis-synchronize decoding if a data bit sequence happens to match the EOL pattern. Consider only consuming a leading EOL when the encoding mode/params require it (e.g., end_of_line for Group 3 1D, mandatory row EOL for Group 3 2D) and avoid scanning for EOLs in Group 4 unless you are explicitly checking for EOFB.

Copilot uses AI. Check for mistakes.
Comment on lines +106 to +115
/// Extract [`CCITTFaxParams`][crate::ccitt::CCITTFaxParams] for the filter at
/// `filter_idx` from the stream's `/DecodeParms` dictionary entry.
///
/// Per PDF spec §7.3.8.2, `/DecodeParms` is either a single dictionary (when
/// there is one filter) or an array of dictionaries (one per filter). Values
/// are always inline objects, so no object resolver is needed.
fn ccitt_params_for_filter(dict: &Dictionary, filter_idx: usize) -> crate::ccitt::CCITTFaxParams {
match dict.get("DecodeParms") {
Some(ObjectVariant::Dictionary(d)) => crate::ccitt::CCITTFaxParams::from_dictionary(d),
Some(ObjectVariant::Array(arr)) => arr
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper assumes /DecodeParms values are always inline objects (and ignores ObjectVariant::Reference). In PDF, /DecodeParms can legally be an indirect reference, so this will silently fall back to defaults and potentially decode CCITT data with the wrong parameters (K/Columns/BlackIs1/etc.). Either resolve references here (which likely requires threading an ObjectResolver into StreamObject::data) or detect references and return a clear DecompressionError/update the doc comment to state indirect /DecodeParms is unsupported.

Copilot uses AI. Check for mistakes.
Comment on lines +192 to +211
/// Decodes CCITTFaxDecode (Group 3 / Group 4 fax) compressed stream data.
///
/// # Parameters
///
/// - `stream_data`: The compressed byte stream to decode.
/// - `params`: Decode parameters parsed from the stream's `/DecodeParms` dictionary.
///
/// # Returns
///
/// The decompressed image data as a packed MSB-first 1-bit-per-pixel `Vec<u8>`.
///
/// # Errors
///
/// Returns [`ObjectError::DecompressionError`] if the stream is truncated or
/// contains an invalid bit pattern.
pub fn decode_ccitt_fax(
stream_data: &[u8],
params: &crate::ccitt::CCITTFaxParams,
) -> Result<Vec<u8>, ObjectError> {
crate::ccitt::decode(stream_data, params)
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decode_ccitt_fax docstring says it returns DecompressionError on truncated/invalid bit patterns, but crate::ccitt::decode() currently only errors for columns == 0 and otherwise decodes best-effort (including on truncation). Please align the documentation with the actual behavior (or change the decoder to return errors as documented).

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +38
/// Whether a block terminator (EOFB / RTC) is present. Default: `true`.
pub end_of_block: bool,
/// If `true`, black = 1 and white = 0. Default: `false` (white = 1).
pub black_is1: bool,
/// Tolerated number of damaged rows before returning an error. Default: `0`.
pub damaged_rows_before_error: u32,
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

damaged_rows_before_error is parsed from /DecodeParms but never used by the decoder. This is misleading (callers may assume it has an effect). Either implement damaged-row tracking/erroring per the PDF spec, or remove the field / mark it as currently ignored in the docs.

Copilot uses AI. Check for mistakes.
Comment on lines +787 to +817
// ─── Tests ───────────────────────────────────────────────────────────────────

#[cfg(test)]
mod tests {
use super::*;

fn params_1d(columns: u32, rows: u32) -> CCITTFaxParams {
CCITTFaxParams {
k: 0,
columns,
rows,
end_of_line: false,
encoded_byte_align: false,
end_of_block: true,
black_is1: false,
damaged_rows_before_error: 0,
}
}

fn params_g4(columns: u32, rows: u32) -> CCITTFaxParams {
CCITTFaxParams {
k: -1,
columns,
rows,
end_of_line: false,
encoded_byte_align: false,
end_of_block: true,
black_is1: false,
damaged_rows_before_error: 0,
}
}
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says the unit tests cover EOL handling, mixed 1D rows, and Group 4 2D modes (V0/Horizontal, row limit), but the tests in this file currently only cover fill_bits/find_bit basics, a few simple 1D cases, and a Group 4 empty-stream case (no 2D mode coverage). Either expand the tests to match the stated coverage (especially around 2D decoding and end-of-block handling) or update the PR description to reflect the current test scope.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants