Skip to content

perf: cache offsetBufferAddress in CometPlainVector for variable-width vectors#4364

Open
0lai0 wants to merge 5 commits into
apache:mainfrom
0lai0:fix-4280-CacheoffsetBufferAdress
Open

perf: cache offsetBufferAddress in CometPlainVector for variable-width vectors#4364
0lai0 wants to merge 5 commits into
apache:mainfrom
0lai0:fix-4280-CacheoffsetBufferAdress

Conversation

@0lai0
Copy link
Copy Markdown
Contributor

@0lai0 0lai0 commented May 19, 2026

Which issue does this PR close?

Closes #4280

Rationale for this change

CometPlainVector already caches the value buffer address, but variable-width reads still fetched the offset buffer address on every getUTF8String and getBinary call. Caching the offset buffer address avoids repeated Arrow buffer lookups on hot per-row paths.

What changes are included in this PR?

  • Cache offsetBufferAddress in CometPlainVector for variable-width vectors
  • Use the cached offset buffer address in getUTF8String and getBinary
  • Add targeted JUnit coverage for variable-width string and binary reads

How are these changes tested?

./mvnw test-compile surefire:test \
  -pl spark \
  -Dtest=org.apache.comet.vector.TestCometPlainVector,org.apache.comet.parquet.TestColumnReader \
  -DfailIfNoTests=false \
  -Dscalastyle.skip=true

@mbutrovich mbutrovich changed the title feat: enhance CometPlainVector to support variable width vectors perf: cache offsetBufferAddress in CometPlainVector for variable-width vectors May 19, 2026
@mbutrovich mbutrovich self-requested a review May 19, 2026 13:43
Comment thread spark/src/test/java/org/apache/comet/vector/TestCometPlainVector.java Outdated
return UTF8String.fromString(convertToUuid(result).toString());
}
} else {
throw new RuntimeException("Unsupported UTF8 vector type: " + valueVector.getName());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to write a test for this error branch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, Thanks for review. I'll add this in next commit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this and I don't think this branch is reachable through the real construction path.

LargeVarCharVector is rejected in the CometDecodedVector constructor via fromArrowType(LargeUtf8), before getUTF8String can be called.
Supported string vectors use VarCharVector, so they take the offsetBufferAddress != -1 branch. So I don't think we should add a test for this branch.

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @0lai0. This seems reasonable to me. Do you have any microbenchmark results you can share?

@0lai0
Copy link
Copy Markdown
Contributor Author

0lai0 commented May 21, 2026

I added a small focused microbenchmark for the VarCharVector / getUTF8String path.

On my local arm64 macOS machine with OpenJDK 17.0.18, using 1M rows, 3 warmup iterations, and 5 measured rounds:

Case Total time Time per row
BEFORE: fetch offsetBufferAddress on every call 4,603,250 ns 4.60 ns/row
AFTER: cached offsetBufferAddress 456,167 ns 0.46 ns/row

This shows about a 10x improvement on this tight loop. The offset buffer address is stable for the lifetime of the vector, so caching it is safe.

This is a simple timing benchmark rather than full JMH, but it directly measures the targeted path.
The benefit is most visible for queries that scan large string columns row-by-row, such as Parquet reads with string filters.

Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits, thanks for picking up this issue @0lai0!

return UTF8String.fromString(convertToUuid(result).toString());
}
} else {
throw new RuntimeException("Unsupported UTF8 vector type: " + valueVector.getName());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this branch reachable in practice? NullVector is already filtered by isNullAt because valueBufferAddress == -1, so I am trying to think which ValueVector subclass would actually land here. If it is truly unreachable, IllegalStateException reads as more idiomatic than RuntimeException.

public class CometPlainVector extends CometDecodedVector {
private final long valueBufferAddress;
private final long offsetBufferAddress;
private final boolean isBaseFixedWidthVector;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a private final boolean isVariableWidthVector alongside isBaseFixedWidthVector instead of using offsetBufferAddress != -1 as the dispatch sentinel. The -1 pattern matches the existing valueBufferAddress precedent, so this is purely taste.

Alternatively, instead of an ever growing list of booleans, can we do an enum (maybe that's a bigger refactor and out of scope)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CometPlainVector: cache offsetBufferAddress for variable-width vectors

3 participants