Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions cpp/src/parquet/types.cc
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,43 @@ std::string FormatFloat16Value(::std::string_view val) {

std::string FormatStatValue(Type::type parquet_type, ::std::string_view val,
const std::shared_ptr<const LogicalType>& logical_type) {
// Statistics blobs come from the file's Thrift-encoded metadata and may have
// arbitrary length under a malicious writer. Reject any value that is shorter
// than what the physical (and, for FLOAT16, logical) type requires so the
// memcpy/byte loads below cannot run past the buffer.
size_t required = 0;
switch (parquet_type) {
case Type::BOOLEAN:
required = sizeof(bool);
break;
case Type::INT32:
case Type::FLOAT:
required = 4;
break;
case Type::INT64:
case Type::DOUBLE:
required = 8;
break;
case Type::INT96:
required = 3 * sizeof(int32_t);
break;
case Type::BYTE_ARRAY:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to put BYTE_ARRAY and UNDEFINED together since they are skipped. In case you don't know, BYTE_ARRAY cannot have float16 logical type.

case Type::FIXED_LEN_BYTE_ARRAY:
if (logical_type != nullptr && logical_type->is_float16()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not dealing with the length of FIXED_LEN_BYTE_ARRAY?

required = 2;
}
break;
case Type::UNDEFINED:
default:
break;
}
if (val.size() < required) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little hesitant to complicate FormatStatValue by introducing this check. FormatStatValue is usually called in the printer when we want to print some information of a parquet file for debugging purpuse.

std::stringstream ss;
ss << "Statistics value of " << val.size() << " bytes is too small for "
<< TypeToString(parquet_type);
throw ParquetException(ss.str());
}

std::stringstream result;
const char* bytes = val.data();
switch (parquet_type) {
Expand Down
23 changes: 23 additions & 0 deletions cpp/src/parquet/types_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
#include <string>

#include "arrow/util/endian.h"
#include "parquet/exception.h"
#include "parquet/types.h"

namespace parquet {
Expand Down Expand Up @@ -179,6 +180,28 @@ TEST(TypePrinter, StatisticsTypes) {
FormatStatValue(Type::FIXED_LEN_BYTE_ARRAY, smin, LogicalType::Float16()));
}

TEST(TypePrinter, StatisticsTypesShortValue) {
// A maliciously crafted Parquet file may set a statistics min/max shorter
// than the column's physical type. FormatStatValue must reject it instead of
// memcpy'ing past the buffer.
ASSERT_THROW(FormatStatValue(Type::BOOLEAN, std::string()), ParquetException);
ASSERT_THROW(FormatStatValue(Type::INT32, std::string("abc")), ParquetException);
ASSERT_THROW(FormatStatValue(Type::INT64, std::string("abcdefg")), ParquetException);
ASSERT_THROW(FormatStatValue(Type::FLOAT, std::string()), ParquetException);
ASSERT_THROW(FormatStatValue(Type::DOUBLE, std::string("1234567")), ParquetException);
ASSERT_THROW(FormatStatValue(Type::INT96, std::string("12345678901")),
ParquetException);
ASSERT_THROW(FormatStatValue(Type::INT32, std::string("abc"),
LogicalType::Decimal(6, 2)),
ParquetException);
ASSERT_THROW(FormatStatValue(Type::INT64, std::string("abcdefg"),
LogicalType::Decimal(18, 4)),
ParquetException);
ASSERT_THROW(FormatStatValue(Type::FIXED_LEN_BYTE_ARRAY, std::string("a"),
LogicalType::Float16()),
ParquetException);
}

TEST(TestInt96Timestamp, Decoding) {
auto check = [](int32_t julian_day, uint64_t nanoseconds) {
#if ARROW_LITTLE_ENDIAN
Expand Down