Skip to content

MINOR: [C++][Parquet] Reject too-short statistics values in FormatStatValue#50025

Open
jmestwa-coder wants to merge 1 commit into
apache:mainfrom
jmestwa-coder:stat-value-short-bound
Open

MINOR: [C++][Parquet] Reject too-short statistics values in FormatStatValue#50025
jmestwa-coder wants to merge 1 commit into
apache:mainfrom
jmestwa-coder:stat-value-short-bound

Conversation

@jmestwa-coder
Copy link
Copy Markdown

FormatStatValue in cpp/src/parquet/types.cc memcpys sizeof(physical_type) bytes from a string_view that comes straight out of the Thrift statistics block, so a crafted file with min_value or max_value shorter than the column's width (for example, a zero-byte stat on an INT96 column) reads past the buffer. The check belongs in this callee since callers in printer.cc just forward whatever the metadata holds. New ASSERT_THROW coverage in types_test.cc exercises the short-value paths for BOOLEAN, INT32, INT64, FLOAT, DOUBLE, INT96, Decimal and Float16.

Comment thread cpp/src/parquet/types.cc
break;
case Type::BYTE_ARRAY:
case Type::FIXED_LEN_BYTE_ARRAY:
if (logical_type != nullptr && logical_type->is_float16()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not dealing with the length of FIXED_LEN_BYTE_ARRAY?

Comment thread cpp/src/parquet/types.cc
case Type::INT96:
required = 3 * sizeof(int32_t);
break;
case Type::BYTE_ARRAY:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to put BYTE_ARRAY and UNDEFINED together since they are skipped. In case you don't know, BYTE_ARRAY cannot have float16 logical type.

Comment thread cpp/src/parquet/types.cc
default:
break;
}
if (val.size() < required) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little hesitant to complicate FormatStatValue by introducing this check. FormatStatValue is usually called in the printer when we want to print some information of a parquet file for debugging purpuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants