Skip to content

Added the material for XGBoost optimization#30

Open
bbhattar wants to merge 3 commits into
intel:mainfrom
bbhattar:xgboost
Open

Added the material for XGBoost optimization#30
bbhattar wants to merge 3 commits into
intel:mainfrom
bbhattar:xgboost

Conversation

@bbhattar
Copy link
Copy Markdown

Added the materials for XGBoost optimization. Please review and give me your feedback.

Comment thread software/xgboost/README.md Outdated

## Introduction

[XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using Intel® oneAPI Data Analytics Library (oneDAL) via its Python interface, `daal4py`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a references section, but worth making this first occurrence a link "Intel® oneAPI Data Analytics Library (oneDAL)"
to https://www.intel.com/content/www/us/en/developer/tools/oneapi/onedal.html or github page whichever you think is a better one.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this product was transferred to UXL foundation, so it no longer has Intel in the name.

Also if you want to put a link, please use this one instead which is up to date:
http://uxlfoundation.github.io/oneDAL/

Comment thread software/xgboost/README.md Outdated

**Software versions:** XGBoost 2.1.4, LightGBM 4.6.0, CatBoost 1.2.10, daal4py 2024.7, Python 3.10.12, scikit-learn 1.5.2

**Hardware:** Intel® Xeon® Platinum 8592+ (Emerald Rapids), 2S/64C/128T per socket, HT On, 503 GB DDR5, single NUMA node
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 sockets, 64 cores/socket, 256 threads - like you had it in line 186 - less ambiguous

Comment thread software/xgboost/README.md Outdated
- **Thread scaling is sub-linear** — using 4x the cores in a single process yields only **2.1x** throughput, because cross-socket memory coherency traffic limits scaling.
- **The tradeoff is latency**: thread scaling achieves **lower per-request latency** (1,230 us at 128 cores) because all cores collaborate on each prediction. Process scaling maintains a fixed latency (~2,000 us per worker, 32 cores each) but delivers **higher aggregate throughput**.

#### Hyperthreading Hurts Performance
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Hyperthreading Hurts Performance
#### Hyper-Threading Hurts Performance

Comment thread software/xgboost/README.md Outdated
|:--------|:---------------|
| Data Format | Use NumPy contiguous arrays (`np.ascontiguousarray()`) as input for best performance |
| Data Type | Use `float32` for maximum throughput; `float64` is also supported |
| Batch Size | oneDAL excels at all batch sizes, with the largest advantage at batch size = 1 (online inference) |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing to -
oneDAL performs well across batch sizes, with the largest advantage at batch size = 1

Comment thread software/xgboost/README.md Outdated
# model = trained XGBoost, LightGBM, or CatBoost model
# X_test = numpy float32 test array

# Convert the model (one line, works for all three frameworks)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clear, but maybe name them explicitly once:

“works for XGBoost, LightGBM, and CatBoost”

Comment thread software/xgboost/README.md Outdated

#### Hyperthreading Hurts Performance

daal4py's AVX-512 vectorized tree traversal is backend-bound — whether the bottleneck is core execution units or memory bandwidth, adding hyperthreads increases resource contention on the shared physical core, harming performance.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link backend-bound to this - https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/top-down-microarchitecture-analysis-method.html
“backend-bound” is technically understandable for performance folks, but for a broader README audience good to include what it means

Comment thread software/xgboost/README.md Outdated
| Data Type | Use `float32` for maximum throughput; `float64` is also supported |
| Batch Size | oneDAL excels at all batch sizes, with the largest advantage at batch size = 1 (online inference) |
| NUMA | For multi-socket systems, pin processes to a single NUMA node to minimize cross-socket memory access |
| daal4py Version | Use the latest version for CatBoost support, missing values support, and performance improvements |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know a minimum version with these support?
Since 'latest' changes over time, it's usually better to specify minimum version required so the README stays accurate in the future.


[XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using Intel® oneAPI Data Analytics Library (oneDAL) via its Python interface, `daal4py`.

By converting trained XGBoost models to oneDAL, you can achieve **up to 36x faster inference** with no loss in prediction quality and minimal code changes. oneDAL leverages Intel® Advanced Vector Extensions 512 (AVX-512) and optimized memory access patterns to maximize performance on Intel hardware.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to 36x faster, but performance table below doesn't show that.

@rsiyer-intel rsiyer-intel requested a review from adgubrud May 18, 2026 19:29
@rsiyer-intel
Copy link
Copy Markdown
Collaborator

Since the latest changes still have perf data, it cannot be approved till we get perf claim pre-requisites fulfilled.

Install `daal4py` from PyPI:

```bash
pip install daal4py
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

daal4py isn't updated anymore. scikit-learn-intelex should be used

Or from conda-forge:

```bash
conda install -c conda-forge daal4py --override-channels
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as R45

Copy link
Copy Markdown

@david-cortes-intel david-cortes-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: this guide says 'xgboost', but it is limited to predictions/inference, while a similar guide could also be done for training, covering details like threading, hyperparameters to try, and similar.


[XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using [Intel® oneAPI Data Analytics Library (oneDAL)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onedal.html) via its Python interface, `daal4py`.

By converting trained XGBoost models to oneDAL, you can achieve **up to 36x faster inference** with no loss in prediction quality and minimal code changes. oneDAL leverages Intel® Advanced Vector Extensions 512 (AVX-512) and optimized memory access patterns to maximize performance on Intel hardware.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be new numbers coming up soon. Note that some of the optimizations in daal4py have been upstreamed to xgboost, so the speed up will not be as large.

CC @razdoburdin


## Introduction

[XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using [Intel® oneAPI Data Analytics Library (oneDAL)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onedal.html) via its Python interface, `daal4py`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that daal4py does not cover all classes of possible xgboost models. It could also suggest other frameworks like onnx.

## Prerequisites

- Intel® Xeon® Scalable Processor (2nd Generation or newer recommended for AVX-512 support)
- Python 3.9 or higher
Copy link
Copy Markdown

@david-cortes-intel david-cortes-intel May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current versions of package scikit-learn-intelex (providing the daal4py module) require python>=3.10. I would leave this out to just 'recent python version' or 'python version supported by scikit-learn-intelex' (could link here: https://github.com/intel/scikit-learn-intelex)


## Prerequisites

- Intel® Xeon® Scalable Processor (2nd Generation or newer recommended for AVX-512 support)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think AVX512 is used by daal4py GBT inference (CC @razdoburdin).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does


- Intel® Xeon® Scalable Processor (2nd Generation or newer recommended for AVX-512 support)
- Python 3.9 or higher
- XGBoost installed (`xgboost` package)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- XGBoost installed (`xgboost` package)
- XGBoost installed (`xgboost` python package from PyPI, or `py-xgboost` from conda-forge)

Install `daal4py` from PyPI:

```bash
pip install daal4py
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outdated. Module daal4py is provided through package scikit-learn-intelex:
https://uxlfoundation.github.io/scikit-learn-intelex/latest/about_daal4py.html

import daal4py as d4p

# Using the lower-level API for more control
daal_model = d4p.get_gbt_model_from_xgboost(clf.get_booster())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not suggest users to do this. The higher-level interface described in the documentation should provide everything needed:
https://uxlfoundation.github.io/scikit-learn-intelex/latest/model_builders.html

| PLAsTiCC | 200,000 | 60 | Classification | 2.81x | 6.50x | 1.11x |
| Airline | 26,969 | 7 | Classification | 1.73x | 3.55x | 10.01x |

**Software versions:** XGBoost 2.1.4, LightGBM 4.6.0, CatBoost 1.2.10, daal4py 2024.7, Python 3.10.12, scikit-learn 1.5.2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are rather outdated versions.

| Data Type | Use `float32` for maximum throughput; `float64` is also supported |
| Batch Size | oneDAL performs well across batch sizes, with the largest advantage at batch size = 1 (online inference) |
| NUMA | For multi-socket systems, pin processes to a single NUMA node to minimize cross-socket memory access |
| daal4py Version | Use daal4py 2023.2 or newer (required for missing values support). Each release includes additional optimizations and bug fixes, so the latest version is recommended |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to avoid hard-coding versions that will become outdated in guides like these.


#### Memory Allocator

Alternative memory allocators such as jemalloc or tcmalloc can sometimes improve performance over the default glibc malloc. It is recommended to test with these enabled to see if either provides a benefit for your workload:
Copy link
Copy Markdown

@david-cortes-intel david-cortes-intel May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would not be used by the libraries involved in this guide (daal4py, numpy).

| PLAsTiCC | 200,000 | 60 | Classification | 2.81x | 6.50x | 1.11x |
| Airline | 26,969 | 7 | Classification | 1.73x | 3.55x | 10.01x |

**Software versions:** XGBoost 2.1.4, LightGBM 4.6.0, CatBoost 1.2.10, daal4py 2024.7, Python 3.10.12, scikit-learn 1.5.2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Software versions are outdated by several years.
Consider updating the measurements with actual versions.

Copy link
Copy Markdown

@razdoburdin razdoburdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update installation instructions and consider switching to the actual versions of the software.


[XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using [Intel® oneAPI Data Analytics Library (oneDAL)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onedal.html) via its Python interface, `daal4py`.

By converting trained XGBoost models to oneDAL, you can achieve **up to 36x faster inference** with no loss in prediction quality and minimal code changes. oneDAL leverages Intel® Advanced Vector Extensions 512 (AVX-512) and optimized memory access patterns to maximize performance on Intel hardware.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is talk about conversion to oneDAL - it's not only XGBoost but LightGBM and CatBoost as well. - so if this section about conversions - then it probably should be reframed.

As for XGBoost itself - there are contributions to upstream XGboost that are happening

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants