feat: [EXPERIMENTAL] Add cost-based optimizer (CBO) for Comet vs Spark execution #3220

andygrove · 2026-01-19T19:04:56Z

Summary

This PR introduces an experimental lightweight cost-based optimizer (CBO) that estimates whether a Comet query plan will be faster than a Spark plan, falling back to Spark when Comet execution is estimated to be slower.

⚠️ EXPERIMENTAL: This feature is disabled by default and should be considered experimental. The cost model parameters are initial estimates and should be tuned with real-world benchmarks before being used in production.

Key Features

Heuristic-based cost model with configurable weights for different operator types:
- Scan, Filter, Project, Aggregate, Join, Sort
Configurable speedup factors for each Comet operator type
Transition penalty for columnar↔row conversions
Cardinality estimation using Spark's logical plan statistics with fallbacks
CBO analysis in EXPLAIN output when enabled

Configuration Options

Config	Default	Description
`spark.comet.cbo.enabled`	`false`	Enable/disable CBO
`spark.comet.cbo.speedupThreshold`	`1.0`	Minimum estimated speedup required to use Comet
`spark.comet.cbo.explain.enabled`	`false`	Log CBO decision details

Additional internal configs for tuning weights and speedup factors are available (see CometConf.scala).

How It Works

After CometExecRule transforms operators to Comet equivalents, CBO analyzes the plan
Collects statistics: Comet operators, Spark operators, transitions
Estimates costs for both Spark-only and Comet execution
Calculates estimated speedup = sparkCost / cometCost
If speedup < threshold, falls back to original Spark plan

Limitations

CBO only affects operator conversion (filter, project, aggregate, join, sort)
Scan conversion is NOT affected - handled separately by CometScanRule
Cost model parameters are estimates and need tuning with benchmarks

Example Usage

// Enable CBO with default threshold
spark.conf.set("spark.comet.cbo.enabled", "true")

// Or with custom threshold (use Comet only if 1.5x faster)
spark.conf.set("spark.comet.cbo.speedupThreshold", "1.5")

// Enable debug logging
spark.conf.set("spark.comet.cbo.explain.enabled", "true")

Files Changed

New: CometCostEstimator.scala - Core cost estimation logic
New: CometCBOSuite.scala - Unit tests
Modified: CometConf.scala - Configuration options
Modified: CometExecRule.scala - CBO integration
Modified: ExtendedExplainInfo.scala - CBO info in EXPLAIN

Test plan

New unit tests in CometCBOSuite (13 tests)
Existing CometExecRuleSuite tests pass
Manual testing with TPC-H/TPC-DS benchmarks (future work)
Tune default parameters based on benchmark results (future work)

🤖 Generated with Claude Code

…execution This PR introduces an experimental lightweight cost-based optimizer that estimates whether a Comet query plan will be faster than a Spark plan, falling back to Spark when Comet execution is estimated to be slower. **Key Features:** - Heuristic-based cost model with configurable weights for different operator types (scan, filter, project, aggregate, join, sort) - Configurable speedup factors for each Comet operator type - Transition penalty for columnar<->row conversions - Cardinality estimation using Spark's logical plan statistics - CBO analysis included in EXPLAIN output when enabled **Configuration:** - `spark.comet.cbo.enabled` (default: false) - Enable/disable CBO - `spark.comet.cbo.speedupThreshold` (default: 1.0) - Minimum speedup required - `spark.comet.cbo.explain.enabled` (default: false) - Log CBO decisions **Important Notes:** - This is an EXPERIMENTAL feature, disabled by default - CBO only affects operator conversion (filter, project, aggregate, etc.), not scan conversion which is handled by CometScanRule - Default parameters are initial estimates and should be tuned with benchmarks Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Enhance the CBO cost model to consider the cost of individual expressions in projections and filters, rather than using a fixed cost per operator. Key changes: - Add DEFAULT_EXPR_COSTS map with cost multipliers for common expressions (e.g., AttributeReference=0.1 since Comet just clones arrays) - Add dynamic config override via spark.comet.cbo.exprCost.<ExpressionName> - Update cost calculation to sum expression costs in filters/projects - Add CometCBOSuite to CI workflows (Linux and macOS) - Add 6 new tests for expression-based costing Expression cost multipliers: - < 1.0 means Comet is faster for this expression - > 1.0 means Spark is faster for this expression - 1.0 means they are equivalent Example config override: spark.conf.set("spark.comet.cbo.exprCost.MyCustomExpr", "1.5") Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

andygrove changed the title ~~[EXPERIMENTAL] Add cost-based optimizer (CBO) for Comet vs Spark execution~~ feat: [EXPERIMENTAL] Add cost-based optimizer (CBO) for Comet vs Spark execution Jan 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: [EXPERIMENTAL] Add cost-based optimizer (CBO) for Comet vs Spark execution #3220

feat: [EXPERIMENTAL] Add cost-based optimizer (CBO) for Comet vs Spark execution #3220

Uh oh!

andygrove commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: [EXPERIMENTAL] Add cost-based optimizer (CBO) for Comet vs Spark execution #3220

Are you sure you want to change the base?

feat: [EXPERIMENTAL] Add cost-based optimizer (CBO) for Comet vs Spark execution #3220

Uh oh!

Conversation

andygrove commented Jan 19, 2026

Summary

Key Features

Configuration Options

How It Works

Limitations

Example Usage

Files Changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant