Skip to content

Conversation

@andygrove
Copy link
Member

Summary

This PR introduces an experimental lightweight cost-based optimizer (CBO) that estimates whether a Comet query plan will be faster than a Spark plan, falling back to Spark when Comet execution is estimated to be slower.

⚠️ EXPERIMENTAL: This feature is disabled by default and should be considered experimental. The cost model parameters are initial estimates and should be tuned with real-world benchmarks before being used in production.

Key Features

  • Heuristic-based cost model with configurable weights for different operator types:
    • Scan, Filter, Project, Aggregate, Join, Sort
  • Configurable speedup factors for each Comet operator type
  • Transition penalty for columnar↔row conversions
  • Cardinality estimation using Spark's logical plan statistics with fallbacks
  • CBO analysis in EXPLAIN output when enabled

Configuration Options

Config Default Description
spark.comet.cbo.enabled false Enable/disable CBO
spark.comet.cbo.speedupThreshold 1.0 Minimum estimated speedup required to use Comet
spark.comet.cbo.explain.enabled false Log CBO decision details

Additional internal configs for tuning weights and speedup factors are available (see CometConf.scala).

How It Works

  1. After CometExecRule transforms operators to Comet equivalents, CBO analyzes the plan
  2. Collects statistics: Comet operators, Spark operators, transitions
  3. Estimates costs for both Spark-only and Comet execution
  4. Calculates estimated speedup = sparkCost / cometCost
  5. If speedup < threshold, falls back to original Spark plan

Limitations

  • CBO only affects operator conversion (filter, project, aggregate, join, sort)
  • Scan conversion is NOT affected - handled separately by CometScanRule
  • Cost model parameters are estimates and need tuning with benchmarks

Example Usage

// Enable CBO with default threshold
spark.conf.set("spark.comet.cbo.enabled", "true")

// Or with custom threshold (use Comet only if 1.5x faster)
spark.conf.set("spark.comet.cbo.speedupThreshold", "1.5")

// Enable debug logging
spark.conf.set("spark.comet.cbo.explain.enabled", "true")

Files Changed

  • New: CometCostEstimator.scala - Core cost estimation logic
  • New: CometCBOSuite.scala - Unit tests
  • Modified: CometConf.scala - Configuration options
  • Modified: CometExecRule.scala - CBO integration
  • Modified: ExtendedExplainInfo.scala - CBO info in EXPLAIN

Test plan

  • New unit tests in CometCBOSuite (13 tests)
  • Existing CometExecRuleSuite tests pass
  • Manual testing with TPC-H/TPC-DS benchmarks (future work)
  • Tune default parameters based on benchmark results (future work)

🤖 Generated with Claude Code

…execution

This PR introduces an experimental lightweight cost-based optimizer that
estimates whether a Comet query plan will be faster than a Spark plan,
falling back to Spark when Comet execution is estimated to be slower.

**Key Features:**
- Heuristic-based cost model with configurable weights for different
  operator types (scan, filter, project, aggregate, join, sort)
- Configurable speedup factors for each Comet operator type
- Transition penalty for columnar<->row conversions
- Cardinality estimation using Spark's logical plan statistics
- CBO analysis included in EXPLAIN output when enabled

**Configuration:**
- `spark.comet.cbo.enabled` (default: false) - Enable/disable CBO
- `spark.comet.cbo.speedupThreshold` (default: 1.0) - Minimum speedup required
- `spark.comet.cbo.explain.enabled` (default: false) - Log CBO decisions

**Important Notes:**
- This is an EXPERIMENTAL feature, disabled by default
- CBO only affects operator conversion (filter, project, aggregate, etc.),
  not scan conversion which is handled by CometScanRule
- Default parameters are initial estimates and should be tuned with benchmarks

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove changed the title [EXPERIMENTAL] Add cost-based optimizer (CBO) for Comet vs Spark execution feat: [EXPERIMENTAL] Add cost-based optimizer (CBO) for Comet vs Spark execution Jan 19, 2026
Enhance the CBO cost model to consider the cost of individual expressions
in projections and filters, rather than using a fixed cost per operator.

Key changes:
- Add DEFAULT_EXPR_COSTS map with cost multipliers for common expressions
  (e.g., AttributeReference=0.1 since Comet just clones arrays)
- Add dynamic config override via spark.comet.cbo.exprCost.<ExpressionName>
- Update cost calculation to sum expression costs in filters/projects
- Add CometCBOSuite to CI workflows (Linux and macOS)
- Add 6 new tests for expression-based costing

Expression cost multipliers:
- < 1.0 means Comet is faster for this expression
- > 1.0 means Spark is faster for this expression
- 1.0 means they are equivalent

Example config override:
  spark.conf.set("spark.comet.cbo.exprCost.MyCustomExpr", "1.5")

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant