[BuildsOnTop Of Julien and Vinoo's work] [WIP] ALP pseudo decimal encoding by prtkgaur · Pull Request #3444 · apache/parquet-java

prtkgaur · 2026-03-11T04:06:22Z

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Add the foundational constants class for ALP encoding: - 7-byte page header matching C++ AlpHeader wire format - Float/double power-of-ten lookup tables - Sampler constants (vector sizes, combination limits) - Encoding limits and magic numbers for fast rounding - Per-vector metadata sizes (AlpInfo, ForInfo) - integerPow10() helper and validateVectorSize()

Add the core ALP encoding/decoding primitives: - Per-value float/double encode and decode - Fast rounding via magic number trick (matching C++) - Exception detection (NaN, Inf, -0.0, round-trip failure) - Bit width calculation for int and long - Bit packed size calculation - Best (exponent, factor) parameter search (full and preset-based)

Implements the core ALP compression pipeline for individual vectors: encode floats/doubles to integers, apply Frame of Reference, bit-pack. Decompression reverses this with exception patching. Includes FloatCompressedVector/DoubleCompressedVector with store/load serialization matching the C++ wire format, and AlpEncodingPreset for preset-based encoding parameter selection.

Implements FloatSampler and DoubleSampler that collect representative samples from input data and generate AlpEncodingPreset with the best (exponent, factor) combinations. Matches C++ AlpSampler behavior: equidistant vector sampling, compressed-size estimation, and top-k combination selection.

Public API for encoding/decoding full ALP pages. Layout: [Header(7B)][Offsets...][Vector0][Vector1]... Header: [compression_mode(1B)][integer_encoding(1B)][log_vector_size(1B)][num_elements(4B)] Supports sampling presets, max size estimation, and multi-vector pages.

Extends ValuesWriter with FloatAlpValuesWriter and DoubleAlpValuesWriter. Buffers values into 1024-element vectors, compresses each when full, and assembles the ALP page format on getBytes(). Sampling is done on the first batch of values to generate the encoding preset. Uses Encoding.PLAIN as placeholder until ALP is added to parquet-format.

Implements AlpValuesReader (abstract), AlpValuesReaderForFloat, and AlpValuesReaderForDouble. Uses lazy per-vector decoding: initFromPage reads only the header and offset array, vectors are decoded on first access. skip() is O(1) with no decoding.

Add AlpCrossImplTest with 7 test cases that decode C++ reference blobs and verify bit-identical output. Reference blobs were generated by the C++ Arrow ALP implementation via generate_reference_blobs.cc. Fix encode/decode math to use two-step multiplication matching C++: - Encode: value * 10^exponent * 10^(-factor) - Decode: encoded * 10^factor * 10^(-exponent) The previous single-operation approach (value / (10^e / 10^f)) produced 1-ULP differences due to different intermediate floating-point rounding.

Benchmark measuring ALP encode and decode throughput across 4 data patterns (decimal, integer, constant, mixed with specials) for both float and double types. Reports compression ratios at startup. Uses carrotsearch JUnit Benchmarks framework matching existing encoding benchmarks in parquet-column (delta, deltalengthbytearray).

Add ALP entry to org.apache.parquet.column.Encoding with getValuesReader returning AlpValuesReaderForFloat/Double. Update AlpValuesWriter.getEncoding() to return Encoding.ALP instead of the PLAIN placeholder.

…files Reads ALP-encoded parquet files (spotify1: 9 columns x 15000 rows, arade: 4 columns x 15000 rows) generated by the C++ Arrow implementation and verifies all double values match the expected CSV data bit-exactly. This validates the full end-to-end ALP read path through the parquet-java reader stack: file metadata parsing, ALP encoding detection, and AlpValuesReaderForDouble decoding. Note: requires parquet-format Encoding.ALP(10) which is not yet upstream in the parquet-format Thrift spec. Build with -Dmdep.skip=true when rebuilding parquet-format-structures to preserve the local ALP addition.

… tests Wire ALP encoding into Java's writer pipeline so ParquetWriter can produce ALP-encoded parquet files. This enables bidirectional interop testing: Java writes ALP parquet that C++ reads, complementing the existing C++ writes / Java reads direction. - Add alpEnabled ColumnProperty to ParquetProperties with isAlpEnabled() - Update DefaultV2ValuesWriterFactory to select ALP writers for float/double - Add withAlpEncoding() builder methods to ParquetWriter - Add GenerateAlpParquet utility to produce test files from CSV data - Add Java interop tests reading back Java-generated ALP parquet files

Add float32 (FLOAT) coverage to the ALP encoding interop tests: - Generator: generateAlpParquetFloat() writes float32 ALP parquet from float expect CSVs using PrimitiveTypeName.FLOAT schema - Tests: readExpectedCsvFloat() and 4 new test methods using Float.floatToIntBits() for bit-exact verification of C++ and Java generated float parquets for spotify1 and arade datasets - Add *.csv to RAT license check exclusions - Add float32 test resources (expect CSVs, C++ and Java parquets)

Add two benchmarks for measuring ALP performance: - AlpCodecThroughput: Codec-level encode/decode throughput in MB/s with 1M values across decimal/integer/mixed datasets for both float and double types. Directly comparable to C++ encoding_alp_benchmark. - AlpDecompressionThroughput: Full Parquet pipeline read throughput measuring end-to-end decompression on real parquet files. Also add Hadoop CRC files for Java-generated test parquet resources.

Writer pipeline integration: - Use CapacityByteArrayOutputStream for encoded vector storage instead of List<byte[]>, integrating with Parquet's memory management - Use BytesInput.concat() for zero-copy page assembly - Accept (initialCapacity, pageSize, ByteBufferAllocator) constructor params; factory now passes pipeline properties Reader memory efficiency: - Allocate decoded buffer once in initFromPage() and reuse across all vector decodes, eliminating per-vector float[]/double[] allocations - Improves decode throughput 5-24% across all datasets Reader validation: - Validate logVectorSize bounds (MIN_LOG to MAX_LOG) - Validate non-negative element count - Validate skip(n) bounds

Reuse encoded long[] buffer across double vector decompression calls instead of allocating new long[1024] per vector. Buffer is allocated once in initFromPage and passed through to decompressDoubleVector. Float decode is intentionally left unchanged — JVM escape analysis produces better code when int[] is allocated fresh per vector call.

Process 32 values at a time in unpackLongs instead of 8, reducing the number of method calls by 4x during double vector decompression. Only applied to the long (double) path; int (float) unpacking is unchanged as JIT produces better code with the original 8-value path.

Hoist DOUBLE_POW10[factor] and DOUBLE_POW10_NEGATIVE[exponent] lookups out of the decode loop and inline the multiplication directly, avoiding per-element method call overhead and enabling better JIT vectorization. Only applied to the double path; float decode keeps the decodeFloat() method call as JIT produces better code for the smaller method.

Replace synthetic random data with actual Spotify audio features columns (valence, danceability, energy, etc.) from alp_spotify1_expect.csv for direct comparability with the C++ encoding_alp_benchmark. The 15K source rows are tiled to 1M values for stable measurement.

Load Spotify CSV files from parquet-hadoop/src/test/resources/ instead of duplicating them into parquet-column/src/test/resources/. The benchmark now resolves the CSV directory relative to the project root.

Compress the 4 expect CSV files with gzip (5.3 MB -> 1.4 MB) and update all readers (TestInteropAlpEncoding, GenerateAlpParquet, AlpCodecThroughput) to decompress via GZIPInputStream. Also regenerate the C++ and Java float parquet test files which had stale/invalid page headers, and remove the Hadoop CRC files that were causing checksum errors.

Replace synthetic random data with real Spotify audio features columns (15K rows from alp_spotify1_expect.csv.gz, tiled 6x to 90K values with integer offset per copy for stable throughput measurement without giving ZSTD unfair cross-copy repetition).

sfc-gh-pgaur added 24 commits March 11, 2026 04:12

ALP: Wire up Encoding.ALP in column Encoding enum and AlpValuesWriter

1cb5ea5

Add ALP entry to org.apache.parquet.column.Encoding with getValuesReader returning AlpValuesReaderForFloat/Double. Update AlpValuesWriter.getEncoding() to return Encoding.ALP instead of the PLAIN placeholder.

ALP: Apply spotless formatting fixes

0d7cd79

ALP: Add encoding comparison benchmark (ALP vs ZSTD vs BSS+ZSTD)

05996ba

ALP: Remove duplicate CSV test data from parquet-column

a598134

Load Spotify CSV files from parquet-hadoop/src/test/resources/ instead of duplicating them into parquet-column/src/test/resources/. The benchmark now resolves the CSV directory relative to the project root.

prtkgaur force-pushed the gh540-alp-pseudoDecimal-encoding branch from c738d2e to 48d5e73 Compare March 11, 2026 04:12

prtkgaur changed the title ~~[WIP][Gh540] ALP pseudo decimal encoding~~ [BuildsOnTop Of Julien and Vinoo's work] [WIP] ALP pseudo decimal encoding Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BuildsOnTop Of Julien and Vinoo's work] [WIP] ALP pseudo decimal encoding#3444

[BuildsOnTop Of Julien and Vinoo's work] [WIP] ALP pseudo decimal encoding#3444
prtkgaur wants to merge 24 commits intoapache:masterfrom
prtkgaur:gh540-alp-pseudoDecimal-encoding

prtkgaur commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

prtkgaur commented Mar 11, 2026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants