Added the material for XGBoost optimization by bbhattar · Pull Request #30 · intel/optimization-zone

bbhattar · 2026-05-12T23:47:26Z

Added the materials for XGBoost optimization. Please review and give me your feedback.

rsiyer-intel · 2026-05-18T18:48:35Z

+
+## Introduction
+
+[XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using Intel® oneAPI Data Analytics Library (oneDAL) via its Python interface, `daal4py`.


You have a references section, but worth making this first occurrence a link "Intel® oneAPI Data Analytics Library (oneDAL)"
to https://www.intel.com/content/www/us/en/developer/tools/oneapi/onedal.html or github page whichever you think is a better one.

Note that this product was transferred to UXL foundation, so it no longer has Intel in the name.

Also if you want to put a link, please use this one instead which is up to date:
http://uxlfoundation.github.io/oneDAL/

rsiyer-intel · 2026-05-18T18:51:23Z

+
+**Software versions:** XGBoost 2.1.4, LightGBM 4.6.0, CatBoost 1.2.10, daal4py 2024.7, Python 3.10.12, scikit-learn 1.5.2
+
+**Hardware:** Intel® Xeon® Platinum 8592+ (Emerald Rapids), 2S/64C/128T per socket, HT On, 503 GB DDR5, single NUMA node


2 sockets, 64 cores/socket, 256 threads - like you had it in line 186 - less ambiguous

rsiyer-intel · 2026-05-18T18:55:46Z

+- **Thread scaling is sub-linear** — using 4x the cores in a single process yields only **2.1x** throughput, because cross-socket memory coherency traffic limits scaling.
+- **The tradeoff is latency**: thread scaling achieves **lower per-request latency** (1,230 us at 128 cores) because all cores collaborate on each prediction. Process scaling maintains a fixed latency (~2,000 us per worker, 32 cores each) but delivers **higher aggregate throughput**.
+
+#### Hyperthreading Hurts Performance


Suggested change

#### Hyperthreading Hurts Performance

#### Hyper-Threading Hurts Performance

rsiyer-intel · 2026-05-18T18:58:34Z

+|:--------|:---------------|
+| Data Format | Use NumPy contiguous arrays (`np.ascontiguousarray()`) as input for best performance |
+| Data Type | Use `float32` for maximum throughput; `float64` is also supported |
+| Batch Size | oneDAL excels at all batch sizes, with the largest advantage at batch size = 1 (online inference) |


Consider changing to -
oneDAL performs well across batch sizes, with the largest advantage at batch size = 1

rsiyer-intel · 2026-05-18T18:59:45Z

+# model = trained XGBoost, LightGBM, or CatBoost model
+# X_test = numpy float32 test array
+
+# Convert the model (one line, works for all three frameworks)


Clear, but maybe name them explicitly once:

“works for XGBoost, LightGBM, and CatBoost”

rsiyer-intel · 2026-05-18T19:04:32Z

+
+#### Hyperthreading Hurts Performance
+
+daal4py's AVX-512 vectorized tree traversal is backend-bound — whether the bottleneck is core execution units or memory bandwidth, adding hyperthreads increases resource contention on the shared physical core, harming performance.


Link backend-bound to this - https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/top-down-microarchitecture-analysis-method.html
“backend-bound” is technically understandable for performance folks, but for a broader README audience good to include what it means

rsiyer-intel · 2026-05-18T19:14:34Z

+| Data Type | Use `float32` for maximum throughput; `float64` is also supported |
+| Batch Size | oneDAL excels at all batch sizes, with the largest advantage at batch size = 1 (online inference) |
+| NUMA | For multi-socket systems, pin processes to a single NUMA node to minimize cross-socket memory access |
+| daal4py Version | Use the latest version for CatBoost support, missing values support, and performance improvements |


Do you know a minimum version with these support?
Since 'latest' changes over time, it's usually better to specify minimum version required so the README stays accurate in the future.

rsiyer-intel · 2026-05-18T19:23:42Z

+
+[XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using Intel® oneAPI Data Analytics Library (oneDAL) via its Python interface, `daal4py`.
+
+By converting trained XGBoost models to oneDAL, you can achieve **up to 36x faster inference** with no loss in prediction quality and minimal code changes. oneDAL leverages Intel® Advanced Vector Extensions 512 (AVX-512) and optimized memory access patterns to maximize performance on Intel hardware.


Up to 36x faster, but performance table below doesn't show that.

rsiyer-intel · 2026-05-20T22:59:23Z

Since the latest changes still have perf data, it cannot be approved till we get perf claim pre-requisites fulfilled.

razdoburdin · 2026-05-26T07:44:39Z

+Install `daal4py` from PyPI:
+
+```bash
+pip install daal4py


daal4py isn't updated anymore. scikit-learn-intelex should be used

razdoburdin · 2026-05-26T07:45:06Z

+Or from conda-forge:
+
+```bash
+conda install -c conda-forge daal4py --override-channels


same as R45

david-cortes-intel

General comment: this guide says 'xgboost', but it is limited to predictions/inference, while a similar guide could also be done for training, covering details like threading, hyperparameters to try, and similar.

david-cortes-intel · 2026-05-26T07:36:28Z

+
+[XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using [Intel® oneAPI Data Analytics Library (oneDAL)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onedal.html) via its Python interface, `daal4py`.
+
+By converting trained XGBoost models to oneDAL, you can achieve **up to 36x faster inference** with no loss in prediction quality and minimal code changes. oneDAL leverages Intel® Advanced Vector Extensions 512 (AVX-512) and optimized memory access patterns to maximize performance on Intel hardware.


There will be new numbers coming up soon. Note that some of the optimizations in daal4py have been upstreamed to xgboost, so the speed up will not be as large.

CC @razdoburdin

david-cortes-intel · 2026-05-26T07:37:20Z

+
+## Introduction
+
+[XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using [Intel® oneAPI Data Analytics Library (oneDAL)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onedal.html) via its Python interface, `daal4py`.


Note that daal4py does not cover all classes of possible xgboost models. It could also suggest other frameworks like onnx.

david-cortes-intel · 2026-05-26T07:38:35Z

+## Prerequisites
+
+- Intel® Xeon® Scalable Processor (2nd Generation or newer recommended for AVX-512 support)
+- Python 3.9 or higher


Current versions of package scikit-learn-intelex (providing the daal4py module) require python>=3.10. I would leave this out to just 'recent python version' or 'python version supported by scikit-learn-intelex' (could link here: https://github.com/intel/scikit-learn-intelex)

david-cortes-intel · 2026-05-26T07:39:04Z

+
+## Prerequisites
+
+- Intel® Xeon® Scalable Processor (2nd Generation or newer recommended for AVX-512 support)


I don't think AVX512 is used by daal4py GBT inference (CC @razdoburdin).

david-cortes-intel · 2026-05-26T07:39:38Z

+
+- Intel® Xeon® Scalable Processor (2nd Generation or newer recommended for AVX-512 support)
+- Python 3.9 or higher
+- XGBoost installed (`xgboost` package)


Suggested change

- XGBoost installed (`xgboost` package)

- XGBoost installed (`xgboost` python package from PyPI, or `py-xgboost` from conda-forge)

david-cortes-intel · 2026-05-26T07:40:16Z

+Install `daal4py` from PyPI:
+
+```bash
+pip install daal4py


This is outdated. Module daal4py is provided through package scikit-learn-intelex:
https://uxlfoundation.github.io/scikit-learn-intelex/latest/about_daal4py.html

david-cortes-intel · 2026-05-26T07:41:44Z

+import daal4py as d4p
+
+# Using the lower-level API for more control
+daal_model = d4p.get_gbt_model_from_xgboost(clf.get_booster())


Please do not suggest users to do this. The higher-level interface described in the documentation should provide everything needed:
https://uxlfoundation.github.io/scikit-learn-intelex/latest/model_builders.html

david-cortes-intel · 2026-05-26T07:42:11Z

+| PLAsTiCC | 200,000 | 60 | Classification | 2.81x | 6.50x | 1.11x |
+| Airline | 26,969 | 7 | Classification | 1.73x | 3.55x | 10.01x |
+
+**Software versions:** XGBoost 2.1.4, LightGBM 4.6.0, CatBoost 1.2.10, daal4py 2024.7, Python 3.10.12, scikit-learn 1.5.2


These are rather outdated versions.

david-cortes-intel · 2026-05-26T07:43:40Z

+| Data Type | Use `float32` for maximum throughput; `float64` is also supported |
+| Batch Size | oneDAL performs well across batch sizes, with the largest advantage at batch size = 1 (online inference) |
+| NUMA | For multi-socket systems, pin processes to a single NUMA node to minimize cross-socket memory access |
+| daal4py Version | Use daal4py 2023.2 or newer (required for missing values support). Each release includes additional optimizations and bug fixes, so the latest version is recommended |


Please try to avoid hard-coding versions that will become outdated in guides like these.

david-cortes-intel · 2026-05-26T07:44:44Z

+
+#### Memory Allocator
+
+Alternative memory allocators such as jemalloc or tcmalloc can sometimes improve performance over the default glibc malloc. It is recommended to test with these enabled to see if either provides a benefit for your workload:


This would not be used by the libraries involved in this guide (daal4py, numpy).

razdoburdin · 2026-05-26T07:48:57Z

+| PLAsTiCC | 200,000 | 60 | Classification | 2.81x | 6.50x | 1.11x |
+| Airline | 26,969 | 7 | Classification | 1.73x | 3.55x | 10.01x |
+
+**Software versions:** XGBoost 2.1.4, LightGBM 4.6.0, CatBoost 1.2.10, daal4py 2024.7, Python 3.10.12, scikit-learn 1.5.2


Software versions are outdated by several years.
Consider updating the measurements with actual versions.

razdoburdin

please update installation instructions and consider switching to the actual versions of the software.

napetrov · 2026-05-27T15:07:55Z

+
+[XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using [Intel® oneAPI Data Analytics Library (oneDAL)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onedal.html) via its Python interface, `daal4py`.
+
+By converting trained XGBoost models to oneDAL, you can achieve **up to 36x faster inference** with no loss in prediction quality and minimal code changes. oneDAL leverages Intel® Advanced Vector Extensions 512 (AVX-512) and optimized memory access patterns to maximize performance on Intel hardware.


If there is talk about conversion to oneDAL - it's not only XGBoost but LightGBM and CatBoost as well. - so if this section about conversions - then it probably should be reframed.

As for XGBoost itself - there are contributions to upstream XGboost that are happening

Added the material for XGBoost optimization

fa838d5

rsiyer-intel reviewed May 18, 2026

View reviewed changes

rsiyer-intel requested changes May 18, 2026

View reviewed changes

rsiyer-intel requested a review from adgubrud May 18, 2026 19:29

bbhattar and others added 2 commits May 19, 2026 21:05

Fixed most of the comments on PR except the result data

6bd1b6b

Merge branch 'intel:main' into xgboost

71513f0

razdoburdin reviewed May 26, 2026

View reviewed changes

david-cortes-intel suggested changes May 26, 2026

View reviewed changes

razdoburdin reviewed May 26, 2026

View reviewed changes

napetrov reviewed May 27, 2026

View reviewed changes


		## Introduction

		[XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using Intel® oneAPI Data Analytics Library (oneDAL) via its Python interface, `daal4py`.


		Software versions: XGBoost 2.1.4, LightGBM 4.6.0, CatBoost 1.2.10, daal4py 2024.7, Python 3.10.12, scikit-learn 1.5.2

		Hardware: Intel® Xeon® Platinum 8592+ (Emerald Rapids), 2S/64C/128T per socket, HT On, 503 GB DDR5, single NUMA node

	#### Hyperthreading Hurts Performance
	#### Hyper-Threading Hurts Performance


		#### Hyperthreading Hurts Performance

		daal4py's AVX-512 vectorized tree traversal is backend-bound — whether the bottleneck is core execution units or memory bandwidth, adding hyperthreads increases resource contention on the shared physical core, harming performance.


		[XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using Intel® oneAPI Data Analytics Library (oneDAL) via its Python interface, `daal4py`.

		By converting trained XGBoost models to oneDAL, you can achieve up to 36x faster inference with no loss in prediction quality and minimal code changes. oneDAL leverages Intel® Advanced Vector Extensions 512 (AVX-512) and optimized memory access patterns to maximize performance on Intel hardware.


		## Prerequisites

		- Intel® Xeon® Scalable Processor (2nd Generation or newer recommended for AVX-512 support)

	- XGBoost installed (`xgboost` package)
	- XGBoost installed (`xgboost` python package from PyPI, or `py-xgboost` from conda-forge)


		#### Memory Allocator

		Alternative memory allocators such as jemalloc or tcmalloc can sometimes improve performance over the default glibc malloc. It is recommended to test with these enabled to see if either provides a benefit for your workload:

Conversation

bbhattar commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rsiyer-intel commented May 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-cortes-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-cortes-intel May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-cortes-intel May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

razdoburdin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

david-cortes-intel May 26, 2026 •

edited

Loading

david-cortes-intel May 26, 2026 •

edited

Loading