Skip to content

[BuildsOnTop Of Julien and Vinoo's work] [WIP] ALP pseudo decimal encoding#3444

Draft
prtkgaur wants to merge 24 commits intoapache:masterfrom
prtkgaur:gh540-alp-pseudoDecimal-encoding
Draft

[BuildsOnTop Of Julien and Vinoo's work] [WIP] ALP pseudo decimal encoding#3444
prtkgaur wants to merge 24 commits intoapache:masterfrom
prtkgaur:gh540-alp-pseudoDecimal-encoding

Conversation

@prtkgaur
Copy link

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Add the foundational constants class for ALP encoding:
- 7-byte page header matching C++ AlpHeader wire format
- Float/double power-of-ten lookup tables
- Sampler constants (vector sizes, combination limits)
- Encoding limits and magic numbers for fast rounding
- Per-vector metadata sizes (AlpInfo, ForInfo)
- integerPow10() helper and validateVectorSize()
Add the core ALP encoding/decoding primitives:
- Per-value float/double encode and decode
- Fast rounding via magic number trick (matching C++)
- Exception detection (NaN, Inf, -0.0, round-trip failure)
- Bit width calculation for int and long
- Bit packed size calculation
- Best (exponent, factor) parameter search (full and preset-based)
Implements the core ALP compression pipeline for individual vectors:
encode floats/doubles to integers, apply Frame of Reference, bit-pack.
Decompression reverses this with exception patching.

Includes FloatCompressedVector/DoubleCompressedVector with store/load
serialization matching the C++ wire format, and AlpEncodingPreset for
preset-based encoding parameter selection.
Implements FloatSampler and DoubleSampler that collect representative
samples from input data and generate AlpEncodingPreset with the best
(exponent, factor) combinations. Matches C++ AlpSampler behavior:
equidistant vector sampling, compressed-size estimation, and top-k
combination selection.
Public API for encoding/decoding full ALP pages. Layout:
[Header(7B)][Offsets...][Vector0][Vector1]...
Header: [compression_mode(1B)][integer_encoding(1B)][log_vector_size(1B)][num_elements(4B)]
Supports sampling presets, max size estimation, and multi-vector pages.
Extends ValuesWriter with FloatAlpValuesWriter and DoubleAlpValuesWriter.
Buffers values into 1024-element vectors, compresses each when full,
and assembles the ALP page format on getBytes(). Sampling is done on
the first batch of values to generate the encoding preset.

Uses Encoding.PLAIN as placeholder until ALP is added to parquet-format.
Implements AlpValuesReader (abstract), AlpValuesReaderForFloat, and
AlpValuesReaderForDouble. Uses lazy per-vector decoding: initFromPage
reads only the header and offset array, vectors are decoded on first
access. skip() is O(1) with no decoding.
Add AlpCrossImplTest with 7 test cases that decode C++ reference blobs
and verify bit-identical output. Reference blobs were generated by the
C++ Arrow ALP implementation via generate_reference_blobs.cc.

Fix encode/decode math to use two-step multiplication matching C++:
- Encode: value * 10^exponent * 10^(-factor)
- Decode: encoded * 10^factor * 10^(-exponent)

The previous single-operation approach (value / (10^e / 10^f)) produced
1-ULP differences due to different intermediate floating-point rounding.
Benchmark measuring ALP encode and decode throughput across 4 data
patterns (decimal, integer, constant, mixed with specials) for both
float and double types. Reports compression ratios at startup.

Uses carrotsearch JUnit Benchmarks framework matching existing encoding
benchmarks in parquet-column (delta, deltalengthbytearray).
Add ALP entry to org.apache.parquet.column.Encoding with getValuesReader
returning AlpValuesReaderForFloat/Double. Update AlpValuesWriter.getEncoding()
to return Encoding.ALP instead of the PLAIN placeholder.
…files

Reads ALP-encoded parquet files (spotify1: 9 columns x 15000 rows, arade:
4 columns x 15000 rows) generated by the C++ Arrow implementation and
verifies all double values match the expected CSV data bit-exactly.

This validates the full end-to-end ALP read path through the parquet-java
reader stack: file metadata parsing, ALP encoding detection, and
AlpValuesReaderForDouble decoding.

Note: requires parquet-format Encoding.ALP(10) which is not yet upstream
in the parquet-format Thrift spec. Build with -Dmdep.skip=true when
rebuilding parquet-format-structures to preserve the local ALP addition.
… tests

Wire ALP encoding into Java's writer pipeline so ParquetWriter can produce
ALP-encoded parquet files. This enables bidirectional interop testing: Java
writes ALP parquet that C++ reads, complementing the existing C++ writes /
Java reads direction.

- Add alpEnabled ColumnProperty to ParquetProperties with isAlpEnabled()
- Update DefaultV2ValuesWriterFactory to select ALP writers for float/double
- Add withAlpEncoding() builder methods to ParquetWriter
- Add GenerateAlpParquet utility to produce test files from CSV data
- Add Java interop tests reading back Java-generated ALP parquet files
Add float32 (FLOAT) coverage to the ALP encoding interop tests:
- Generator: generateAlpParquetFloat() writes float32 ALP parquet from
  float expect CSVs using PrimitiveTypeName.FLOAT schema
- Tests: readExpectedCsvFloat() and 4 new test methods using
  Float.floatToIntBits() for bit-exact verification of C++ and Java
  generated float parquets for spotify1 and arade datasets
- Add *.csv to RAT license check exclusions
- Add float32 test resources (expect CSVs, C++ and Java parquets)
Add two benchmarks for measuring ALP performance:
- AlpCodecThroughput: Codec-level encode/decode throughput in MB/s
  with 1M values across decimal/integer/mixed datasets for both
  float and double types. Directly comparable to C++ encoding_alp_benchmark.
- AlpDecompressionThroughput: Full Parquet pipeline read throughput
  measuring end-to-end decompression on real parquet files.

Also add Hadoop CRC files for Java-generated test parquet resources.
Writer pipeline integration:
- Use CapacityByteArrayOutputStream for encoded vector storage instead
  of List<byte[]>, integrating with Parquet's memory management
- Use BytesInput.concat() for zero-copy page assembly
- Accept (initialCapacity, pageSize, ByteBufferAllocator) constructor
  params; factory now passes pipeline properties

Reader memory efficiency:
- Allocate decoded buffer once in initFromPage() and reuse across all
  vector decodes, eliminating per-vector float[]/double[] allocations
- Improves decode throughput 5-24% across all datasets

Reader validation:
- Validate logVectorSize bounds (MIN_LOG to MAX_LOG)
- Validate non-negative element count
- Validate skip(n) bounds
Reuse encoded long[] buffer across double vector decompression calls
instead of allocating new long[1024] per vector. Buffer is allocated
once in initFromPage and passed through to decompressDoubleVector.

Float decode is intentionally left unchanged — JVM escape analysis
produces better code when int[] is allocated fresh per vector call.
Process 32 values at a time in unpackLongs instead of 8, reducing
the number of method calls by 4x during double vector decompression.
Only applied to the long (double) path; int (float) unpacking is
unchanged as JIT produces better code with the original 8-value path.
Hoist DOUBLE_POW10[factor] and DOUBLE_POW10_NEGATIVE[exponent] lookups
out of the decode loop and inline the multiplication directly, avoiding
per-element method call overhead and enabling better JIT vectorization.
Only applied to the double path; float decode keeps the decodeFloat()
method call as JIT produces better code for the smaller method.
Replace synthetic random data with actual Spotify audio features columns
(valence, danceability, energy, etc.) from alp_spotify1_expect.csv for
direct comparability with the C++ encoding_alp_benchmark. The 15K source
rows are tiled to 1M values for stable measurement.
Load Spotify CSV files from parquet-hadoop/src/test/resources/ instead
of duplicating them into parquet-column/src/test/resources/. The
benchmark now resolves the CSV directory relative to the project root.
Compress the 4 expect CSV files with gzip (5.3 MB -> 1.4 MB) and update
all readers (TestInteropAlpEncoding, GenerateAlpParquet, AlpCodecThroughput)
to decompress via GZIPInputStream. Also regenerate the C++ and Java float
parquet test files which had stale/invalid page headers, and remove the
Hadoop CRC files that were causing checksum errors.
Replace synthetic random data with real Spotify audio features
columns (15K rows from alp_spotify1_expect.csv.gz, tiled 6x to
90K values with integer offset per copy for stable throughput
measurement without giving ZSTD unfair cross-copy repetition).
@prtkgaur prtkgaur force-pushed the gh540-alp-pseudoDecimal-encoding branch from c738d2e to 48d5e73 Compare March 11, 2026 04:12
@prtkgaur prtkgaur changed the title [WIP][Gh540] ALP pseudo decimal encoding [BuildsOnTop Of Julien and Vinoo's work] [WIP] ALP pseudo decimal encoding Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants