Added the material for XGBoost optimization#30
Conversation
|
|
||
| ## Introduction | ||
|
|
||
| [XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using Intel® oneAPI Data Analytics Library (oneDAL) via its Python interface, `daal4py`. |
There was a problem hiding this comment.
You have a references section, but worth making this first occurrence a link "Intel® oneAPI Data Analytics Library (oneDAL)"
to https://www.intel.com/content/www/us/en/developer/tools/oneapi/onedal.html or github page whichever you think is a better one.
There was a problem hiding this comment.
Note that this product was transferred to UXL foundation, so it no longer has Intel in the name.
Also if you want to put a link, please use this one instead which is up to date:
http://uxlfoundation.github.io/oneDAL/
|
|
||
| **Software versions:** XGBoost 2.1.4, LightGBM 4.6.0, CatBoost 1.2.10, daal4py 2024.7, Python 3.10.12, scikit-learn 1.5.2 | ||
|
|
||
| **Hardware:** Intel® Xeon® Platinum 8592+ (Emerald Rapids), 2S/64C/128T per socket, HT On, 503 GB DDR5, single NUMA node |
There was a problem hiding this comment.
2 sockets, 64 cores/socket, 256 threads - like you had it in line 186 - less ambiguous
| - **Thread scaling is sub-linear** — using 4x the cores in a single process yields only **2.1x** throughput, because cross-socket memory coherency traffic limits scaling. | ||
| - **The tradeoff is latency**: thread scaling achieves **lower per-request latency** (1,230 us at 128 cores) because all cores collaborate on each prediction. Process scaling maintains a fixed latency (~2,000 us per worker, 32 cores each) but delivers **higher aggregate throughput**. | ||
|
|
||
| #### Hyperthreading Hurts Performance |
There was a problem hiding this comment.
| #### Hyperthreading Hurts Performance | |
| #### Hyper-Threading Hurts Performance |
| |:--------|:---------------| | ||
| | Data Format | Use NumPy contiguous arrays (`np.ascontiguousarray()`) as input for best performance | | ||
| | Data Type | Use `float32` for maximum throughput; `float64` is also supported | | ||
| | Batch Size | oneDAL excels at all batch sizes, with the largest advantage at batch size = 1 (online inference) | |
There was a problem hiding this comment.
Consider changing to -
oneDAL performs well across batch sizes, with the largest advantage at batch size = 1
| # model = trained XGBoost, LightGBM, or CatBoost model | ||
| # X_test = numpy float32 test array | ||
|
|
||
| # Convert the model (one line, works for all three frameworks) |
There was a problem hiding this comment.
Clear, but maybe name them explicitly once:
“works for XGBoost, LightGBM, and CatBoost”
|
|
||
| #### Hyperthreading Hurts Performance | ||
|
|
||
| daal4py's AVX-512 vectorized tree traversal is backend-bound — whether the bottleneck is core execution units or memory bandwidth, adding hyperthreads increases resource contention on the shared physical core, harming performance. |
There was a problem hiding this comment.
Link backend-bound to this - https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/top-down-microarchitecture-analysis-method.html
“backend-bound” is technically understandable for performance folks, but for a broader README audience good to include what it means
| | Data Type | Use `float32` for maximum throughput; `float64` is also supported | | ||
| | Batch Size | oneDAL excels at all batch sizes, with the largest advantage at batch size = 1 (online inference) | | ||
| | NUMA | For multi-socket systems, pin processes to a single NUMA node to minimize cross-socket memory access | | ||
| | daal4py Version | Use the latest version for CatBoost support, missing values support, and performance improvements | |
There was a problem hiding this comment.
Do you know a minimum version with these support?
Since 'latest' changes over time, it's usually better to specify minimum version required so the README stays accurate in the future.
|
|
||
| [XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using Intel® oneAPI Data Analytics Library (oneDAL) via its Python interface, `daal4py`. | ||
|
|
||
| By converting trained XGBoost models to oneDAL, you can achieve **up to 36x faster inference** with no loss in prediction quality and minimal code changes. oneDAL leverages Intel® Advanced Vector Extensions 512 (AVX-512) and optimized memory access patterns to maximize performance on Intel hardware. |
There was a problem hiding this comment.
Up to 36x faster, but performance table below doesn't show that.
|
Since the latest changes still have perf data, it cannot be approved till we get perf claim pre-requisites fulfilled. |
| Install `daal4py` from PyPI: | ||
|
|
||
| ```bash | ||
| pip install daal4py |
There was a problem hiding this comment.
daal4py isn't updated anymore. scikit-learn-intelex should be used
| Or from conda-forge: | ||
|
|
||
| ```bash | ||
| conda install -c conda-forge daal4py --override-channels |
david-cortes-intel
left a comment
There was a problem hiding this comment.
General comment: this guide says 'xgboost', but it is limited to predictions/inference, while a similar guide could also be done for training, covering details like threading, hyperparameters to try, and similar.
|
|
||
| [XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using [Intel® oneAPI Data Analytics Library (oneDAL)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onedal.html) via its Python interface, `daal4py`. | ||
|
|
||
| By converting trained XGBoost models to oneDAL, you can achieve **up to 36x faster inference** with no loss in prediction quality and minimal code changes. oneDAL leverages Intel® Advanced Vector Extensions 512 (AVX-512) and optimized memory access patterns to maximize performance on Intel hardware. |
There was a problem hiding this comment.
There will be new numbers coming up soon. Note that some of the optimizations in daal4py have been upstreamed to xgboost, so the speed up will not be as large.
CC @razdoburdin
|
|
||
| ## Introduction | ||
|
|
||
| [XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using [Intel® oneAPI Data Analytics Library (oneDAL)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onedal.html) via its Python interface, `daal4py`. |
There was a problem hiding this comment.
Note that daal4py does not cover all classes of possible xgboost models. It could also suggest other frameworks like onnx.
| ## Prerequisites | ||
|
|
||
| - Intel® Xeon® Scalable Processor (2nd Generation or newer recommended for AVX-512 support) | ||
| - Python 3.9 or higher |
There was a problem hiding this comment.
Current versions of package scikit-learn-intelex (providing the daal4py module) require python>=3.10. I would leave this out to just 'recent python version' or 'python version supported by scikit-learn-intelex' (could link here: https://github.com/intel/scikit-learn-intelex)
|
|
||
| ## Prerequisites | ||
|
|
||
| - Intel® Xeon® Scalable Processor (2nd Generation or newer recommended for AVX-512 support) |
There was a problem hiding this comment.
I don't think AVX512 is used by daal4py GBT inference (CC @razdoburdin).
|
|
||
| - Intel® Xeon® Scalable Processor (2nd Generation or newer recommended for AVX-512 support) | ||
| - Python 3.9 or higher | ||
| - XGBoost installed (`xgboost` package) |
There was a problem hiding this comment.
| - XGBoost installed (`xgboost` package) | |
| - XGBoost installed (`xgboost` python package from PyPI, or `py-xgboost` from conda-forge) |
| Install `daal4py` from PyPI: | ||
|
|
||
| ```bash | ||
| pip install daal4py |
There was a problem hiding this comment.
This is outdated. Module daal4py is provided through package scikit-learn-intelex:
https://uxlfoundation.github.io/scikit-learn-intelex/latest/about_daal4py.html
| import daal4py as d4p | ||
|
|
||
| # Using the lower-level API for more control | ||
| daal_model = d4p.get_gbt_model_from_xgboost(clf.get_booster()) |
There was a problem hiding this comment.
Please do not suggest users to do this. The higher-level interface described in the documentation should provide everything needed:
https://uxlfoundation.github.io/scikit-learn-intelex/latest/model_builders.html
| | PLAsTiCC | 200,000 | 60 | Classification | 2.81x | 6.50x | 1.11x | | ||
| | Airline | 26,969 | 7 | Classification | 1.73x | 3.55x | 10.01x | | ||
|
|
||
| **Software versions:** XGBoost 2.1.4, LightGBM 4.6.0, CatBoost 1.2.10, daal4py 2024.7, Python 3.10.12, scikit-learn 1.5.2 |
There was a problem hiding this comment.
These are rather outdated versions.
| | Data Type | Use `float32` for maximum throughput; `float64` is also supported | | ||
| | Batch Size | oneDAL performs well across batch sizes, with the largest advantage at batch size = 1 (online inference) | | ||
| | NUMA | For multi-socket systems, pin processes to a single NUMA node to minimize cross-socket memory access | | ||
| | daal4py Version | Use daal4py 2023.2 or newer (required for missing values support). Each release includes additional optimizations and bug fixes, so the latest version is recommended | |
There was a problem hiding this comment.
Please try to avoid hard-coding versions that will become outdated in guides like these.
|
|
||
| #### Memory Allocator | ||
|
|
||
| Alternative memory allocators such as jemalloc or tcmalloc can sometimes improve performance over the default glibc malloc. It is recommended to test with these enabled to see if either provides a benefit for your workload: |
There was a problem hiding this comment.
This would not be used by the libraries involved in this guide (daal4py, numpy).
| | PLAsTiCC | 200,000 | 60 | Classification | 2.81x | 6.50x | 1.11x | | ||
| | Airline | 26,969 | 7 | Classification | 1.73x | 3.55x | 10.01x | | ||
|
|
||
| **Software versions:** XGBoost 2.1.4, LightGBM 4.6.0, CatBoost 1.2.10, daal4py 2024.7, Python 3.10.12, scikit-learn 1.5.2 |
There was a problem hiding this comment.
Software versions are outdated by several years.
Consider updating the measurements with actual versions.
razdoburdin
left a comment
There was a problem hiding this comment.
please update installation instructions and consider switching to the actual versions of the software.
|
|
||
| [XGBoost](https://xgboost.readthedocs.io/) is one of the most popular and efficient gradient boosting frameworks for classification and regression tasks on tabular data. This guide covers techniques to significantly accelerate XGBoost inference on Intel® Xeon® processors using [Intel® oneAPI Data Analytics Library (oneDAL)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onedal.html) via its Python interface, `daal4py`. | ||
|
|
||
| By converting trained XGBoost models to oneDAL, you can achieve **up to 36x faster inference** with no loss in prediction quality and minimal code changes. oneDAL leverages Intel® Advanced Vector Extensions 512 (AVX-512) and optimized memory access patterns to maximize performance on Intel hardware. |
There was a problem hiding this comment.
If there is talk about conversion to oneDAL - it's not only XGBoost but LightGBM and CatBoost as well. - so if this section about conversions - then it probably should be reframed.
As for XGBoost itself - there are contributions to upstream XGboost that are happening
Added the materials for XGBoost optimization. Please review and give me your feedback.