From 4672221acd4da1a0848b084262e0449c90daa240 Mon Sep 17 00:00:00 2001 From: lucasmelogithub Date: Tue, 5 May 2026 13:31:24 -0500 Subject: [PATCH 1/8] Initial vLLM Recipe --- README.md | 1 + software/vllm/README.md | 465 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 466 insertions(+) create mode 100644 software/vllm/README.md diff --git a/README.md b/README.md index bffdb79..fd0bd37 100644 --- a/README.md +++ b/README.md @@ -44,6 +44,7 @@ We aim to provide a dynamic resource where users can find the latest optimizatio - [ResNet50 – Computer Vision](software/tensorflow/computer-vision-resnet50/README.md) - [BERT – NLP](software/tensorflow/nlp-transformers-bert/README.md) - [RGAT – Graph Neural Networks](software/tensorflow/graph-neural-networks-rgat/README.md) + - [vLLM](software/vllm/README.md) - Workloads - [Cassandra Stress](workloads/cassandra-stress/README.md) - [HPC](workloads/hpc/README.md) diff --git a/software/vllm/README.md b/software/vllm/README.md new file mode 100644 index 0000000..d89a53a --- /dev/null +++ b/software/vllm/README.md @@ -0,0 +1,465 @@ +# vLLM on Intel Xeon with Intel AMX + +This recipe shows how to configure vLLM for CPU inference and serving on Intel Xeon processors with Intel Advanced Matrix Extensions (Intel AMX). It focuses on the practical settings that most affect performance: BF16 execution, CPU thread binding, NUMA placement, KV cache sizing, batch limits, and model selection. + +The goal is not to replace the vLLM documentation. Use the official vLLM CPU installation guide for package-specific setup details, then use this recipe to choose a high-performance Intel Xeon configuration. + +## Table of Contents + +- [vLLM on Intel Xeon with Intel AMX](#vllm-on-intel-xeon-with-intel-amx) + - [Table of Contents](#table-of-contents) + - [Overview](#overview) + - [Recommended Baseline](#recommended-baseline) + - [Why Intel Xeon for vLLM](#why-intel-xeon-for-vllm) + - [Prerequisites](#prerequisites) + - [Docker Quick Start](#docker-quick-start) + - [Quick Start](#quick-start) + - [AMX and BF16 Configuration](#amx-and-bf16-configuration) + - [CPU Threading and NUMA](#cpu-threading-and-numa) + - [Default Recommendation](#default-recommendation) + - [Manual Binding Example](#manual-binding-example) + - [NUMA Checklist](#numa-checklist) + - [KV Cache and Memory Sizing](#kv-cache-and-memory-sizing) + - [Xeon 6 for SLM Inference](#xeon-6-for-slm-inference) + - [Large Models on Xeon Memory Capacity](#large-models-on-xeon-memory-capacity) + - [Tuning Reference](#tuning-reference) + - [Validation and Benchmarking](#validation-and-benchmarking) + - [Functional Validation](#functional-validation) + - [Placement Validation](#placement-validation) + - [Benchmark Sweep](#benchmark-sweep) + - [Example Benchmark Command](#example-benchmark-command) + - [Troubleshooting](#troubleshooting) + - [FAQ](#faq) + - [What is the minimum vLLM version for Intel Xeon AMX deployments?](#what-is-the-minimum-vllm-version-for-intel-xeon-amx-deployments) + - [Should I use BF16 or FP16 on CPU?](#should-i-use-bf16-or-fp16-on-cpu) + - [How much KV cache should I allocate?](#how-much-kv-cache-should-i-allocate) + - [Should tensor parallel size always equal socket count?](#should-tensor-parallel-size-always-equal-socket-count) + - [When should I use quantization?](#when-should-i-use-quantization) + - [Disclaimer](#disclaimer) + - [References](#references) + +## Overview + +vLLM supports model inferencing and OpenAI-compatible serving on x86 CPUs with FP32, FP16, and BF16. On Intel Xeon processors that expose Intel AMX BF16 instructions, BF16 is the preferred dtype because it reduces memory traffic and enables matrix kernels designed for modern Xeon CPUs. + +Use this recipe when you want to: + +- Serve small language models (SLMs) without a discrete accelerator. +- Host models or context lengths that benefit from the larger DRAM capacity available on CPU servers. +- Run inference close to CPU-resident data pipelines, vector databases, or enterprise services. +- Tune vLLM CPU deployments beyond a default install. + +## Recommended Baseline + +| Item | Recommendation | +| --- | --- | +| vLLM version | Use vLLM `0.17.0` or newer as the minimum packaged x86 CPU baseline. vLLM CPU release wheels for x86 with AVX512/AVX2 are available starting with `0.17.0`; prefer the latest stable release for the newest CPU and AMX kernel work. | +| CPU | Intel Xeon 6 recommended as of May 2026. Intel Xeon 4th Gen or newer with `amx_tile` and `amx_bf16` CPU flags. | +| dtype | Use `--dtype=bfloat16` for AMX-capable Xeon systems. | +| OS | Linux. | +| Python | Python 3.10 through 3.13, following the vLLM CPU installation guide. | +| Threading | Start with `VLLM_CPU_OMP_THREADS_BIND=auto` and reserve 1-2 CPU cores for the serving process. | +| Parallelism | On multi-socket systems, start with tensor parallel size equal to the number of NUMA nodes, except values that the current vLLM release does not support. | +| Memory | Size `VLLM_CPU_KVCACHE_SPACE` per NUMA node so model weight shards, KV cache, runtime workspace, and OS headroom all fit in local memory. | + + +## Why Intel Xeon for vLLM + +Intel Xeon is a strong fit for vLLM CPU deployments when memory capacity, deployment simplicity, and CPU locality matter as much as peak accelerator throughput. + +| Use case | Why Xeon helps | Tuning priority | +| --- | --- | --- | +| SLM serving | 1B-8B parameter models can be served with BF16 on AMX-capable Xeon systems, often with enough memory left for a larger KV cache and co-located application services. | BF16, thread binding, small batch limits, reserved serving cores. | +| Large model capacity | CPU servers can be configured with hundreds of GiB to TiB-class DRAM, which can hold models, quantized weights, or long-context KV caches that may not fit in a single GPU's HBM. | NUMA-aware sharding, KV cache sizing, quantization, memory headroom. | +| Enterprise integration | CPU inference can run close to existing data services, retrieval pipelines, security controls, and orchestration tools. | Stable packaging, deterministic placement, observability, repeatable benchmarks. | +| Throughput-oriented batch jobs | Offline inference can trade latency for throughput by increasing batch limits and using more sockets or NUMA nodes. | `--max-num-batched-tokens`, `--max-num-seqs`, DP/TP/PP, KV cache. | + +## Prerequisites + +Verify the platform before tuning vLLM. + +```bash +lscpu | grep -E "Model name|Socket|Core|Thread|NUMA node|Flags" +lscpu | grep -E "amx_(tile|bf16|int8)|avx512_bf16" +numactl --hardware +``` + +Expected CPU flags for the AMX BF16 path include: + +- `amx_tile` +- `amx_bf16` +- `avx512_bf16` +- `amx_int8` + +Install vLLM by following the official CPU installation guide. For release-wheel deployments, use vLLM `0.17.0` or newer and install the CPU wheel variant. + + +For wheel-based installs, TCMalloc and Intel OpenMP must be preloaded before running vLLM: + +```bash +# Install TCMalloc (Intel OpenMP is bundled with the vLLM CPU wheel) +sudo apt-get install -y --no-install-recommends libtcmalloc-minimal4 + +# Locate the libraries +TC_PATH=$(find /usr -name 'libtcmalloc_minimal.so.4' | head -1) +IOMP_PATH=$(python -c "import intel_openmp, os; print(os.path.join(os.path.dirname(intel_openmp.__file__), 'lib', 'libiomp5.so'))" 2>/dev/null \ + || find / -name 'libiomp5.so' 2>/dev/null | head -1) + +# Preload them for every vLLM session +export LD_PRELOAD="${TC_PATH}:${IOMP_PATH}:${LD_PRELOAD}" +``` + +Skipping `LD_PRELOAD` can silently degrade throughput. Add the export to your shell profile or container entrypoint. + +After installation, collect the runtime environment: + +```bash +vllm collect-env +``` + +If PyTorch exposes the AMX helper in your environment, this quick check can confirm that the runtime sees AMX tile support: + +```bash +python - <<'PY' +import torch + +checker = getattr(torch.cpu, "_is_amx_tile_supported", None) +print("AMX tile supported:", checker() if checker else "not reported by this PyTorch build") +PY +``` + +### Docker Quick Start + +vLLM publishes pre-built CPU Docker images. Pull the latest x86_64 CPU image: + +```bash +docker pull vllm/vllm-openai-cpu:latest-x86_64 +``` + +Then run it with the environment variables from above: + +```bash +docker run --rm \ + --security-opt seccomp=unconfined \ + --cap-add SYS_NICE \ + --shm-size=4g \ + -p 8000:8000 \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + -e VLLM_CPU_KVCACHE_SPACE=20 \ + -e HF_TOKEN="${HF_TOKEN}" \ + vllm/vllm-openai-cpu:latest-x86_64 \ + Qwen/Qwen3-4B \ + --dtype=bfloat16 \ + --max-num-batched-tokens 2048 \ + --max-num-seqs 64 +``` + +Note: `--security-opt seccomp=unconfined` and `--cap-add SYS_NICE` are needed for NUMA memory policy calls inside the container. Omitting them may produce `get_mempolicy: Operation not permitted` warnings. + +## Quick Start + +Start with a CPU-validated SLM, BF16, automatic NUMA-aware thread binding, and a conservative KV cache size. Increase memory and batch settings only after the baseline is stable. + +```bash +export VLLM_CPU_KVCACHE_SPACE=20 +export VLLM_CPU_OMP_THREADS_BIND=auto +export VLLM_CPU_NUM_OF_RESERVED_CPU=1 + +vllm serve Qwen/Qwen3-4B \ + --device cpu \ + --dtype=bfloat16 \ + --max-num-batched-tokens 2048 \ + --max-num-seqs 64 +``` + +Send a test request: + +```bash +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen/Qwen3-4B", + "messages": [{"role": "user", "content": "Give three tips for CPU inference tuning."}], + "max_tokens": 128 + }' +``` + +For a multi-NUMA system, start by matching tensor parallel size to the NUMA node count: + +```bash +NUMA_NODES=$(lscpu | awk '/NUMA node\(s\):/ {print $3}') + +export VLLM_CPU_KVCACHE_SPACE=40 +export VLLM_CPU_OMP_THREADS_BIND=auto +export VLLM_CPU_NUM_OF_RESERVED_CPU=1 + +vllm serve meta-llama/Llama-3.1-8B-Instruct \ + --device cpu \ + --dtype=bfloat16 \ + --tensor-parallel-size "${NUMA_NODES}" \ + --max-num-batched-tokens $((2048 * NUMA_NODES)) \ + --max-num-seqs $((128 * NUMA_NODES)) +``` + +Before using a generated `NUMA_NODES` value, check the vLLM CPU documentation for currently unsupported tensor parallel sizes. For example, the current CPU guide notes that `tensor-parallel-size=6` is not supported. + + +## AMX and BF16 Configuration + +For AMX-capable Intel Xeon CPUs, the most important vLLM decision is to use BF16 explicitly: + +```bash +vllm serve --device cpu --dtype=bfloat16 +``` + +Why this matters: + +- BF16 is the recommended CPU dtype when FP16 behavior is unstable or slower on CPU. +- BF16 reduces memory traffic compared with FP32. +- AMX BF16 kernels accelerate matrix-heavy LLM operations when the CPU, PyTorch, and vLLM CPU backend can use them. + +Optional small-batch kernel path: + +```bash +export VLLM_CPU_SGL_KERNEL=1 +``` + +Use `VLLM_CPU_SGL_KERNEL=1` only after the baseline works. It is x86-only and experimental. The vLLM CPU guide states that it requires AMX, BF16 weights, and weight shapes divisible by 32. It is aimed at low-latency online serving with small batches, so validate it per model and workload before using it in production. + +## CPU Threading and NUMA + +The CPU backend is sensitive to where OpenMP threads run and where memory is allocated. Start with automatic binding, then move to manual binding if utilization or latency is uneven. + +### Default Recommendation + +```bash +export VLLM_CPU_OMP_THREADS_BIND=auto +export VLLM_CPU_NUM_OF_RESERVED_CPU=1 +``` + +`auto` binds OpenMP threads for each rank to CPU cores in NUMA nodes. Reserving one or two cores prevents the vLLM API server, tokenizer work, networking, logging, and operating system tasks from competing with inference threads. + +### Manual Binding Example + +Use manual binding when you need repeatability or when `htop` shows threads crossing NUMA nodes unexpectedly. + +```bash +export VLLM_CPU_OMP_THREADS_BIND=0-55|56-111 +export VLLM_CPU_KVCACHE_SPACE=40 + +vllm serve \ + --device cpu \ + --dtype=bfloat16 \ + --tensor-parallel-size 2 +``` + +In this example, rank 0 uses CPU cores `0-55` and rank 1 uses CPU cores `56-111`. Adjust the ranges to physical cores from the same NUMA node. Avoid spreading a single rank across sockets unless you have measured that it helps your workload. + +### NUMA Checklist + +- Use `numactl --hardware` to identify NUMA nodes and memory per node. +- Keep each tensor-parallel or pipeline-parallel rank within one NUMA node when possible. +- Use `CPU_VISIBLE_MEMORY_NODES` to mask or reorder NUMA memory nodes when using automatic binding. +- Watch CPU placement with `htop` or `perf stat` during warmup and benchmark runs. + +## KV Cache and Memory Sizing + +`VLLM_CPU_KVCACHE_SPACE` is specified in GiB and applies to each CPU worker/rank. Larger values allow more concurrent requests and longer contexts, but the allocation must fit in the local memory budget for each NUMA node. + +Use this sizing rule for each rank: + +```text +local NUMA memory > model weight shard + VLLM_CPU_KVCACHE_SPACE + runtime workspace + OS headroom +``` + +Estimate BF16 model weight memory as: + +```text +weight shard GiB ~= model parameters * 2 bytes / tensor_parallel_size / 2^30 +``` + +Then leave headroom for activation buffers, tokenizer/server processes, page cache, framework overhead, and other colocated services. A practical starting point is to reserve at least 10-20% of each NUMA node's memory instead of assigning all free memory to KV cache. + +Examples: + +| Scenario | Starting point | Why | +| --- | --- | --- | +| SLM, low concurrency | `VLLM_CPU_KVCACHE_SPACE=10` to `20` | Keeps memory pressure low while validating BF16 and thread placement. | +| SLM, higher concurrency | `VLLM_CPU_KVCACHE_SPACE=20` to `40` | Supports more simultaneous sessions and longer prompts. | +| 8B-class model on a large-memory node | `VLLM_CPU_KVCACHE_SPACE=40` or higher | Uses Xeon DRAM capacity for larger batches or context lengths. | +| Multi-NUMA tensor parallel | Size per NUMA node | Each rank needs local memory for its weight shard plus its KV cache. | + +If the worker exits with code 9 or the process is killed by the OOM killer, reduce `VLLM_CPU_KVCACHE_SPACE`, reduce batch limits, lower tensor-parallel pressure per node, or use a smaller/quantized model. + +## Xeon 6 for SLM Inference + +Intel Xeon 6 systems that expose AMX BF16 are well suited for SLM inference because the models are small enough to keep memory pressure manageable while AMX accelerates BF16 matrix operations. + +Good first models from the vLLM CPU-validated model list include: + +| Model | Typical use | Why start here | +| --- | --- | --- | +| `Qwen/Qwen3-1.7B` | Very small assistant, routing, classification-style generation | Fast baseline for validating install, BF16, and thread binding. | +| `ibm-granite/granite-3.2-2b-instruct` | Enterprise assistant, summarization, RAG | Small enough for CPU serving experiments with room for KV cache. | +| `meta-llama/Llama-3.2-3B-Instruct` | General chat and instruction following | Common SLM shape with broad ecosystem support. | +| `Qwen/Qwen3-4B` | Higher quality SLM serving | Good step up after the 1B-3B baseline is stable. | +| `Qwen/Qwen3-8B` or `meta-llama/Llama-3.1-8B-Instruct` | Larger SLM or compact LLM serving | Useful for multi-NUMA tuning and memory-capacity validation. | + +For latency-sensitive SLM serving on Xeon 6: + +1. Use `--dtype=bfloat16`. +2. Start with `--max-num-seqs 32` to `64` and `--max-num-batched-tokens 1024` to `2048`. +3. Reserve one or two CPU cores per rank for serving overhead. +4. Validate the optional `VLLM_CPU_SGL_KERNEL=1` path only after the default path is stable. +5. Increase batch limits gradually while watching inter-token latency, time to first token, and CPU utilization. + +## Large Models on Xeon Memory Capacity + +Many GPU deployments are constrained by the memory capacity of a single accelerator. Intel Xeon servers can be configured with substantially larger system memory, which can be useful when the model, KV cache, or context length does not fit comfortably in accelerator memory. + +This does not make CPU inference universally faster than GPU inference. It changes the design space: + +- Use Xeon when capacity, cost, data locality, or CPU-only deployment constraints dominate. +- Use quantization to reduce model memory and memory bandwidth pressure. +- Use tensor parallelism across NUMA nodes when a model shard plus KV cache fits cleanly per node. +- Prefer smaller batch sizes for interactive latency and larger batch sizes for offline throughput. +- Avoid filling all DRAM with weights and KV cache; memory headroom is what keeps tail latency stable. + +For very large models, benchmark both BF16 and quantized variants. A quantized model may reduce memory traffic enough to improve throughput, but accuracy, prompt behavior, and supported kernels must be validated for your model. + +## Tuning Reference + +| Setting | What it controls | Starting point | Tune when | +| --- | --- | --- | --- | +| `--dtype=bfloat16` | Model compute dtype | Always use on AMX-capable Xeon unless a model requires otherwise. | Accuracy or compatibility issues appear. | +| `VLLM_CPU_KVCACHE_SPACE` | KV cache memory per CPU worker/rank, in GiB | `20` for SLMs; `40` or higher for larger models or concurrency. | You see preemption, OOM, low concurrency, or long-context failures. | +| `VLLM_CPU_OMP_THREADS_BIND` | OpenMP thread placement | `auto` | CPU utilization is uneven or threads cross NUMA nodes. | +| `VLLM_CPU_NUM_OF_RESERVED_CPU` | Cores reserved from OpenMP binding | `1` for small systems, `1-2` for serving workloads. | API server latency rises or CPU oversubscription appears. | +| `CPU_VISIBLE_MEMORY_NODES` | NUMA memory node visibility and order | Leave unset initially. | You need to mask NUMA nodes or control binding sequence. | +| `--tensor-parallel-size` | Weight sharding across ranks | Number of NUMA nodes, where supported. | Model shard plus KV cache does not fit per node or throughput scales poorly. | +| `--pipeline-parallel-size` | Layer partitioning across ranks | `1` initially. | Model is too large or TP alone does not fit cleanly. | +| `--data-parallel-size` | Independent replica count | `1` initially. | Throughput is limited and enough sockets/nodes are available. | +| `--max-num-batched-tokens` | Tokens allowed in one scheduler batch | Online: `2048 * world_size`; offline: `4096 * world_size`. | Time to first token or throughput misses the target. | +| `--max-num-seqs` | Sequences allowed in one scheduler batch | Online: `128 * world_size`; offline: `256 * world_size`. | Inter-token latency or output throughput misses the target. | +| `--block-size` | KV cache block granularity | Keep the default or use multiples of 32. | You are doing controlled CPU performance sweeps. | +| `VLLM_CPU_SGL_KERNEL` | Experimental small-batch optimized x86 kernels | `0` initially. | Low-latency SLM serving is stable and the model meets AMX/BF16/shape requirements. | + +`world_size` is the product of tensor, pipeline, and data parallel ranks used by the vLLM deployment. + +## Validation and Benchmarking + +Use repeatable validation before changing multiple knobs at once. + +### Functional Validation + +```bash +vllm collect-env +lscpu | grep -E "amx_(tile|bf16|int8)|avx512_bf16" +numactl --hardware +``` + +Start the server with one known model and send a short request. Confirm that the response is correct before increasing batch size or parallelism. + +### Placement Validation + +While vLLM is serving traffic, check that inference threads stay on the intended cores: + +```bash +htop +``` + +For a scriptable check, run a short benchmark and record CPU, memory, and NUMA behavior with tools such as `perf stat`, `numastat`, or platform telemetry. + +### Benchmark Sweep + +For each model and hardware configuration, sweep these values independently: + +- `VLLM_CPU_KVCACHE_SPACE` +- `--max-num-batched-tokens` +- `--max-num-seqs` +- `--tensor-parallel-size` +- `VLLM_CPU_OMP_THREADS_BIND` +- quantized versus BF16 weights + +Track at least these metrics: + +- Time to first token (TTFT) +- Inter-token latency (ITL) +- Output tokens per second +- Requests per second +- Peak RSS and memory per NUMA node +- CPU utilization per socket +- Error rate and OOM events + +Use the vLLM benchmark CLI or the vLLM Benchmark Suite for repeatable comparisons. For CPU-supported models, the vLLM documentation points to CPU benchmark test cases that include optimized example configurations and dry-run command generation. + +### Example Benchmark Command + +Run a latency benchmark with the vLLM CLI: + +```bash +vllm bench latency \ + --model Qwen/Qwen3-4B \ + --input-len 256 \ + --output-len 128 \ + --batch-size 8 \ + --dtype bfloat16 \ + --device cpu +``` + +Or, from a [vLLM source checkout](https://github.com/vllm-project/vllm), use the Benchmark Suite dry-run to generate optimized serving commands for CPU models: + +```bash +ON_CPU=1 SERVING_JSON=serving-tests-cpu-text.json DRY_RUN=1 \ + MODEL_FILTER=Qwen/Qwen3-4B DTYPE_FILTER=bfloat16 \ + bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh +``` + +The generated `.commands` files in `./benchmark/results/` contain the full CLI invocations with optimized settings for each model. + +## Troubleshooting + +| Symptom | Likely cause | Fix | +| --- | --- | --- | +| Worker exits with code 9 or process is killed | Per-rank model shard plus KV cache exceeds NUMA memory. | Reduce `VLLM_CPU_KVCACHE_SPACE`, lower batch limits, use quantization, or change TP/PP layout. | +| CPU utilization is high but latency is poor | Oversubscription or API server competing with inference threads. | Reserve 1-2 cores with `VLLM_CPU_NUM_OF_RESERVED_CPU` or manual binding. | +| One socket is busy and another is idle | Thread binding or NUMA node visibility is wrong. | Use `VLLM_CPU_OMP_THREADS_BIND=auto`, set `CPU_VISIBLE_MEMORY_NODES`, or manually bind ranks. | +| TTFT is too high | Prefill batch is too large or model/context is too heavy. | Lower `--max-num-batched-tokens`, reduce prompt length, use a smaller model, or increase parallelism. | +| Inter-token latency is too high | Too many active sequences or insufficient compute per rank. | Lower `--max-num-seqs`, use a smaller SLM, tune TP/PP, or test `VLLM_CPU_SGL_KERNEL=1` where supported. | +| BF16 model is slower than expected | AMX not visible, unsupported CPU, wrong wheel/build, or poor binding. | Recheck CPU flags, `vllm collect-env`, PyTorch AMX helper, and thread placement. | +| Docker logs show NUMA permission warnings | Container lacks permissions needed by NUMA calls. | Use the vLLM CPU Docker guidance, including appropriate security options for your environment. | + +## FAQ + +### What is the minimum vLLM version for Intel Xeon AMX deployments? + +Use vLLM `0.17.0` or newer as the minimum packaged x86 CPU deployment baseline. The official CPU installation guide states that pre-built x86 CPU wheels with AVX512/AVX2 are available starting in `0.17.0`. AMX usage is then determined by the CPU flags, the installed CPU wheel or source build, PyTorch CPU capability detection, model dtype, and selected vLLM CPU kernels. Prefer the latest stable vLLM release when tuning AMX systems. + +### Should I use BF16 or FP16 on CPU? + +Use BF16. vLLM's CPU guide recommends explicitly setting `dtype=bfloat16` if FP16 has performance or accuracy issues on CPU, and BF16 is the natural dtype for AMX BF16 acceleration on Intel Xeon. + +### How much KV cache should I allocate? + +Allocate only what fits per NUMA node after model weights and headroom. Start with `20` GiB for SLMs, then increase gradually. For multi-rank deployments, remember that `VLLM_CPU_KVCACHE_SPACE` applies per CPU worker/rank. + +### Should tensor parallel size always equal socket count? + +Not always. It is a good first test when each socket maps cleanly to a NUMA node and the vLLM release supports that tensor parallel size. Use benchmarks to compare TP, PP, and DP layouts for your model. + +### When should I use quantization? + +Use quantization after you have a BF16 baseline. It is most valuable when memory capacity or memory bandwidth limits the deployment, or when a larger model needs to fit in available DRAM. + +## Disclaimer + +Performance varies by use, configuration, and other factors. Learn more on the [Performance Index site](https://edc.intel.com/content/www/us/en/products/performance/benchmarks/overview/). No product or component can be absolutely secure. Intel technologies may require enabled hardware, software, or service activation. See [Legal Notices and Disclaimers](https://www.intel.com/LegalNoticesAndDisclaimers). + +## References + +- [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/) +- [vLLM CPU hardware-supported models for Intel Xeon](https://docs.vllm.ai/en/stable/models/hardware_supported_models/cpu/) +- [vLLM optimization and tuning guide](https://docs.vllm.ai/en/stable/configuration/optimization/) +- [vLLM Intel quantization support](https://docs.vllm.ai/en/stable/features/quantization/inc/) +- [vLLM CPU installation documentation source](https://github.com/vllm-project/vllm/blob/main/docs/getting_started/installation/cpu.md) +- [vLLM GitHub repository](https://github.com/vllm-project/vllm) From 3dc0dc5cf7af11d95c95b7b527864776a6658bd1 Mon Sep 17 00:00:00 2001 From: lucasmelogithub Date: Wed, 6 May 2026 06:50:06 -0500 Subject: [PATCH 2/8] Enahnced Readme --- software/vllm/README.md | 170 ++++++---------------------------------- 1 file changed, 26 insertions(+), 144 deletions(-) diff --git a/software/vllm/README.md b/software/vllm/README.md index d89a53a..9965ff1 100644 --- a/software/vllm/README.md +++ b/software/vllm/README.md @@ -4,23 +4,22 @@ This recipe shows how to configure vLLM for CPU inference and serving on Intel X The goal is not to replace the vLLM documentation. Use the official vLLM CPU installation guide for package-specific setup details, then use this recipe to choose a high-performance Intel Xeon configuration. +vLLM CPU docs: + ## Table of Contents - [vLLM on Intel Xeon with Intel AMX](#vllm-on-intel-xeon-with-intel-amx) - [Table of Contents](#table-of-contents) - [Overview](#overview) - [Recommended Baseline](#recommended-baseline) - - [Why Intel Xeon for vLLM](#why-intel-xeon-for-vllm) - [Prerequisites](#prerequisites) - - [Docker Quick Start](#docker-quick-start) - - [Quick Start](#quick-start) + - [Serve and validate vLLM with Docker](#serve-and-validate-vllm-with-docker) - [AMX and BF16 Configuration](#amx-and-bf16-configuration) - [CPU Threading and NUMA](#cpu-threading-and-numa) - [Default Recommendation](#default-recommendation) - [Manual Binding Example](#manual-binding-example) - [NUMA Checklist](#numa-checklist) - [KV Cache and Memory Sizing](#kv-cache-and-memory-sizing) - - [Xeon 6 for SLM Inference](#xeon-6-for-slm-inference) - [Large Models on Xeon Memory Capacity](#large-models-on-xeon-memory-capacity) - [Tuning Reference](#tuning-reference) - [Validation and Benchmarking](#validation-and-benchmarking) @@ -28,19 +27,12 @@ The goal is not to replace the vLLM documentation. Use the official vLLM CPU ins - [Placement Validation](#placement-validation) - [Benchmark Sweep](#benchmark-sweep) - [Example Benchmark Command](#example-benchmark-command) - - [Troubleshooting](#troubleshooting) - - [FAQ](#faq) - - [What is the minimum vLLM version for Intel Xeon AMX deployments?](#what-is-the-minimum-vllm-version-for-intel-xeon-amx-deployments) - - [Should I use BF16 or FP16 on CPU?](#should-i-use-bf16-or-fp16-on-cpu) - - [How much KV cache should I allocate?](#how-much-kv-cache-should-i-allocate) - - [Should tensor parallel size always equal socket count?](#should-tensor-parallel-size-always-equal-socket-count) - - [When should I use quantization?](#when-should-i-use-quantization) - [Disclaimer](#disclaimer) - [References](#references) ## Overview -vLLM supports model inferencing and OpenAI-compatible serving on x86 CPUs with FP32, FP16, and BF16. On Intel Xeon processors that expose Intel AMX BF16 instructions, BF16 is the preferred dtype because it reduces memory traffic and enables matrix kernels designed for modern Xeon CPUs. +vLLM supports model inferencing and OpenAI-compatible serving on x86 CPUs with FP32, FP16, and BF16. On Intel Xeon processors that expose Intel AMX instructions, BF16 is the preferred dtype because it reduces memory traffic and enables AMX BF16 matrix kernels designed for modern Xeon CPUs. Intel AMX also supports INT8 operations (`amx_int8`), which vLLM can leverage through INT8 quantization paths (such as compressed-tensor W8A8) for further memory and bandwidth savings when model accuracy permits. Use this recipe when you want to: @@ -53,26 +45,13 @@ Use this recipe when you want to: | Item | Recommendation | | --- | --- | -| vLLM version | Use vLLM `0.17.0` or newer as the minimum packaged x86 CPU baseline. vLLM CPU release wheels for x86 with AVX512/AVX2 are available starting with `0.17.0`; prefer the latest stable release for the newest CPU and AMX kernel work. | -| CPU | Intel Xeon 6 recommended as of May 2026. Intel Xeon 4th Gen or newer with `amx_tile` and `amx_bf16` CPU flags. | -| dtype | Use `--dtype=bfloat16` for AMX-capable Xeon systems. | -| OS | Linux. | -| Python | Python 3.10 through 3.13, following the vLLM CPU installation guide. | -| Threading | Start with `VLLM_CPU_OMP_THREADS_BIND=auto` and reserve 1-2 CPU cores for the serving process. | +| vLLM version | Use vLLM `0.17.0` cpu container or newer. | +| CPU | Intel Xeon 6 is recommended as of May 2026. Intel Xeon 4th Gen or newer with `amx_tile`, `amx_bf16`, and `amx_int8` CPU flags should be used. | +| dtype | Use `--dtype=bfloat16` for AMX BF16 serving. For INT8 quantized models, AMX INT8 kernels are used automatically when the model provides a compatible quantization config. | +| Memory | Size `VLLM_CPU_KVCACHE_SPACE=40` is a good starting point | +| Threading | Start with `VLLM_CPU_OMP_THREADS_BIND=auto` | | Parallelism | On multi-socket systems, start with tensor parallel size equal to the number of NUMA nodes, except values that the current vLLM release does not support. | -| Memory | Size `VLLM_CPU_KVCACHE_SPACE` per NUMA node so model weight shards, KV cache, runtime workspace, and OS headroom all fit in local memory. | - - -## Why Intel Xeon for vLLM - -Intel Xeon is a strong fit for vLLM CPU deployments when memory capacity, deployment simplicity, and CPU locality matter as much as peak accelerator throughput. - -| Use case | Why Xeon helps | Tuning priority | -| --- | --- | --- | -| SLM serving | 1B-8B parameter models can be served with BF16 on AMX-capable Xeon systems, often with enough memory left for a larger KV cache and co-located application services. | BF16, thread binding, small batch limits, reserved serving cores. | -| Large model capacity | CPU servers can be configured with hundreds of GiB to TiB-class DRAM, which can hold models, quantized weights, or long-context KV caches that may not fit in a single GPU's HBM. | NUMA-aware sharding, KV cache sizing, quantization, memory headroom. | -| Enterprise integration | CPU inference can run close to existing data services, retrieval pipelines, security controls, and orchestration tools. | Stable packaging, deterministic placement, observability, repeatable benchmarks. | -| Throughput-oriented batch jobs | Offline inference can trade latency for throughput by increasing batch limits and using more sockets or NUMA nodes. | `--max-num-batched-tokens`, `--max-num-seqs`, DP/TP/PP, KV cache. | +| Python | Python 3.10 through 3.13, following the vLLM CPU installation guide. | ## Prerequisites @@ -84,51 +63,14 @@ lscpu | grep -E "amx_(tile|bf16|int8)|avx512_bf16" numactl --hardware ``` -Expected CPU flags for the AMX BF16 path include: - -- `amx_tile` -- `amx_bf16` -- `avx512_bf16` -- `amx_int8` - -Install vLLM by following the official CPU installation guide. For release-wheel deployments, use vLLM `0.17.0` or newer and install the CPU wheel variant. - - -For wheel-based installs, TCMalloc and Intel OpenMP must be preloaded before running vLLM: - -```bash -# Install TCMalloc (Intel OpenMP is bundled with the vLLM CPU wheel) -sudo apt-get install -y --no-install-recommends libtcmalloc-minimal4 - -# Locate the libraries -TC_PATH=$(find /usr -name 'libtcmalloc_minimal.so.4' | head -1) -IOMP_PATH=$(python -c "import intel_openmp, os; print(os.path.join(os.path.dirname(intel_openmp.__file__), 'lib', 'libiomp5.so'))" 2>/dev/null \ - || find / -name 'libiomp5.so' 2>/dev/null | head -1) - -# Preload them for every vLLM session -export LD_PRELOAD="${TC_PATH}:${IOMP_PATH}:${LD_PRELOAD}" -``` - -Skipping `LD_PRELOAD` can silently degrade throughput. Add the export to your shell profile or container entrypoint. - -After installation, collect the runtime environment: - -```bash -vllm collect-env -``` - -If PyTorch exposes the AMX helper in your environment, this quick check can confirm that the runtime sees AMX tile support: +Expected CPU flags for AMX acceleration: -```bash -python - <<'PY' -import torch - -checker = getattr(torch.cpu, "_is_amx_tile_supported", None) -print("AMX tile supported:", checker() if checker else "not reported by this PyTorch build") -PY -``` +- `amx_tile` — required base for all AMX paths +- `amx_bf16` — enables AMX BF16 matrix operations (primary inference dtype) +- `amx_int8` — enables AMX INT8 matrix operations (used by INT8 quantized models) +- `avx512_bf16` — scalar/vector BF16 support complementing AMX -### Docker Quick Start +## Serve and validate vLLM with Docker vLLM publishes pre-built CPU Docker images. Pull the latest x86_64 CPU image: @@ -139,13 +81,18 @@ docker pull vllm/vllm-openai-cpu:latest-x86_64 Then run it with the environment variables from above: ```bash +export VLLM_CPU_KVCACHE_SPACE=40 +export VLLM_CPU_OMP_THREADS_BIND=auto +export VLLM_CPU_NUM_OF_RESERVED_CPU=1 +export VLLM_CPU_SGL_KERNEL=1 + docker run --rm \ --security-opt seccomp=unconfined \ --cap-add SYS_NICE \ --shm-size=4g \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ - -e VLLM_CPU_KVCACHE_SPACE=20 \ + -e VLLM_CPU_KVCACHE_SPACE=40 \ -e HF_TOKEN="${HF_TOKEN}" \ vllm/vllm-openai-cpu:latest-x86_64 \ Qwen/Qwen3-4B \ @@ -156,14 +103,12 @@ docker run --rm \ Note: `--security-opt seccomp=unconfined` and `--cap-add SYS_NICE` are needed for NUMA memory policy calls inside the container. Omitting them may produce `get_mempolicy: Operation not permitted` warnings. -## Quick Start - -Start with a CPU-validated SLM, BF16, automatic NUMA-aware thread binding, and a conservative KV cache size. Increase memory and batch settings only after the baseline is stable. ```bash -export VLLM_CPU_KVCACHE_SPACE=20 +export VLLM_CPU_KVCACHE_SPACE=40 export VLLM_CPU_OMP_THREADS_BIND=auto export VLLM_CPU_NUM_OF_RESERVED_CPU=1 +export VLLM_CPU_SGL_KERNEL=1 vllm serve Qwen/Qwen3-4B \ --device cpu \ @@ -192,6 +137,7 @@ NUMA_NODES=$(lscpu | awk '/NUMA node\(s\):/ {print $3}') export VLLM_CPU_KVCACHE_SPACE=40 export VLLM_CPU_OMP_THREADS_BIND=auto export VLLM_CPU_NUM_OF_RESERVED_CPU=1 +export VLLM_CPU_SGL_KERNEL=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \ --device cpu \ @@ -203,7 +149,6 @@ vllm serve meta-llama/Llama-3.1-8B-Instruct \ Before using a generated `NUMA_NODES` value, check the vLLM CPU documentation for currently unsupported tensor parallel sizes. For example, the current CPU guide notes that `tensor-parallel-size=6` is not supported. - ## AMX and BF16 Configuration For AMX-capable Intel Xeon CPUs, the most important vLLM decision is to use BF16 explicitly: @@ -217,14 +162,7 @@ Why this matters: - BF16 is the recommended CPU dtype when FP16 behavior is unstable or slower on CPU. - BF16 reduces memory traffic compared with FP32. - AMX BF16 kernels accelerate matrix-heavy LLM operations when the CPU, PyTorch, and vLLM CPU backend can use them. - -Optional small-batch kernel path: - -```bash -export VLLM_CPU_SGL_KERNEL=1 -``` - -Use `VLLM_CPU_SGL_KERNEL=1` only after the baseline works. It is x86-only and experimental. The vLLM CPU guide states that it requires AMX, BF16 weights, and weight shapes divisible by 32. It is aimed at low-latency online serving with small batches, so validate it per model and workload before using it in production. +- AMX INT8 kernels provide additional acceleration for INT8-quantized models (e.g., W8A8 compressed-tensor), reducing both memory footprint and compute cost. Use INT8 after validating model quality with your target prompts. ## CPU Threading and NUMA @@ -291,28 +229,6 @@ Examples: If the worker exits with code 9 or the process is killed by the OOM killer, reduce `VLLM_CPU_KVCACHE_SPACE`, reduce batch limits, lower tensor-parallel pressure per node, or use a smaller/quantized model. -## Xeon 6 for SLM Inference - -Intel Xeon 6 systems that expose AMX BF16 are well suited for SLM inference because the models are small enough to keep memory pressure manageable while AMX accelerates BF16 matrix operations. - -Good first models from the vLLM CPU-validated model list include: - -| Model | Typical use | Why start here | -| --- | --- | --- | -| `Qwen/Qwen3-1.7B` | Very small assistant, routing, classification-style generation | Fast baseline for validating install, BF16, and thread binding. | -| `ibm-granite/granite-3.2-2b-instruct` | Enterprise assistant, summarization, RAG | Small enough for CPU serving experiments with room for KV cache. | -| `meta-llama/Llama-3.2-3B-Instruct` | General chat and instruction following | Common SLM shape with broad ecosystem support. | -| `Qwen/Qwen3-4B` | Higher quality SLM serving | Good step up after the 1B-3B baseline is stable. | -| `Qwen/Qwen3-8B` or `meta-llama/Llama-3.1-8B-Instruct` | Larger SLM or compact LLM serving | Useful for multi-NUMA tuning and memory-capacity validation. | - -For latency-sensitive SLM serving on Xeon 6: - -1. Use `--dtype=bfloat16`. -2. Start with `--max-num-seqs 32` to `64` and `--max-num-batched-tokens 1024` to `2048`. -3. Reserve one or two CPU cores per rank for serving overhead. -4. Validate the optional `VLLM_CPU_SGL_KERNEL=1` path only after the default path is stable. -5. Increase batch limits gradually while watching inter-token latency, time to first token, and CPU utilization. - ## Large Models on Xeon Memory Capacity Many GPU deployments are constrained by the memory capacity of a single accelerator. Intel Xeon servers can be configured with substantially larger system memory, which can be useful when the model, KV cache, or context length does not fit comfortably in accelerator memory. @@ -417,40 +333,6 @@ ON_CPU=1 SERVING_JSON=serving-tests-cpu-text.json DRY_RUN=1 \ The generated `.commands` files in `./benchmark/results/` contain the full CLI invocations with optimized settings for each model. -## Troubleshooting - -| Symptom | Likely cause | Fix | -| --- | --- | --- | -| Worker exits with code 9 or process is killed | Per-rank model shard plus KV cache exceeds NUMA memory. | Reduce `VLLM_CPU_KVCACHE_SPACE`, lower batch limits, use quantization, or change TP/PP layout. | -| CPU utilization is high but latency is poor | Oversubscription or API server competing with inference threads. | Reserve 1-2 cores with `VLLM_CPU_NUM_OF_RESERVED_CPU` or manual binding. | -| One socket is busy and another is idle | Thread binding or NUMA node visibility is wrong. | Use `VLLM_CPU_OMP_THREADS_BIND=auto`, set `CPU_VISIBLE_MEMORY_NODES`, or manually bind ranks. | -| TTFT is too high | Prefill batch is too large or model/context is too heavy. | Lower `--max-num-batched-tokens`, reduce prompt length, use a smaller model, or increase parallelism. | -| Inter-token latency is too high | Too many active sequences or insufficient compute per rank. | Lower `--max-num-seqs`, use a smaller SLM, tune TP/PP, or test `VLLM_CPU_SGL_KERNEL=1` where supported. | -| BF16 model is slower than expected | AMX not visible, unsupported CPU, wrong wheel/build, or poor binding. | Recheck CPU flags, `vllm collect-env`, PyTorch AMX helper, and thread placement. | -| Docker logs show NUMA permission warnings | Container lacks permissions needed by NUMA calls. | Use the vLLM CPU Docker guidance, including appropriate security options for your environment. | - -## FAQ - -### What is the minimum vLLM version for Intel Xeon AMX deployments? - -Use vLLM `0.17.0` or newer as the minimum packaged x86 CPU deployment baseline. The official CPU installation guide states that pre-built x86 CPU wheels with AVX512/AVX2 are available starting in `0.17.0`. AMX usage is then determined by the CPU flags, the installed CPU wheel or source build, PyTorch CPU capability detection, model dtype, and selected vLLM CPU kernels. Prefer the latest stable vLLM release when tuning AMX systems. - -### Should I use BF16 or FP16 on CPU? - -Use BF16. vLLM's CPU guide recommends explicitly setting `dtype=bfloat16` if FP16 has performance or accuracy issues on CPU, and BF16 is the natural dtype for AMX BF16 acceleration on Intel Xeon. - -### How much KV cache should I allocate? - -Allocate only what fits per NUMA node after model weights and headroom. Start with `20` GiB for SLMs, then increase gradually. For multi-rank deployments, remember that `VLLM_CPU_KVCACHE_SPACE` applies per CPU worker/rank. - -### Should tensor parallel size always equal socket count? - -Not always. It is a good first test when each socket maps cleanly to a NUMA node and the vLLM release supports that tensor parallel size. Use benchmarks to compare TP, PP, and DP layouts for your model. - -### When should I use quantization? - -Use quantization after you have a BF16 baseline. It is most valuable when memory capacity or memory bandwidth limits the deployment, or when a larger model needs to fit in available DRAM. - ## Disclaimer Performance varies by use, configuration, and other factors. Learn more on the [Performance Index site](https://edc.intel.com/content/www/us/en/products/performance/benchmarks/overview/). No product or component can be absolutely secure. Intel technologies may require enabled hardware, software, or service activation. See [Legal Notices and Disclaimers](https://www.intel.com/LegalNoticesAndDisclaimers). From e49fc003f7920ae121f54e583487d0638d0cffa9 Mon Sep 17 00:00:00 2001 From: lucasmelogithub Date: Wed, 6 May 2026 08:10:44 -0500 Subject: [PATCH 3/8] Enahnced Readme, TOC --- software/vllm/README.md | 63 +++++++++++++++++++++-------------------- 1 file changed, 32 insertions(+), 31 deletions(-) diff --git a/software/vllm/README.md b/software/vllm/README.md index 9965ff1..a62ec2d 100644 --- a/software/vllm/README.md +++ b/software/vllm/README.md @@ -1,10 +1,6 @@ # vLLM on Intel Xeon with Intel AMX -This recipe shows how to configure vLLM for CPU inference and serving on Intel Xeon processors with Intel Advanced Matrix Extensions (Intel AMX). It focuses on the practical settings that most affect performance: BF16 execution, CPU thread binding, NUMA placement, KV cache sizing, batch limits, and model selection. - -The goal is not to replace the vLLM documentation. Use the official vLLM CPU installation guide for package-specific setup details, then use this recipe to choose a high-performance Intel Xeon configuration. - -vLLM CPU docs: +This recipe configures vLLM for CPU inference and serving on Intel Xeon processors with Intel Advanced Matrix Extensions (Intel AMX). It focuses on the settings that most affect performance: BF16 execution, CPU thread binding, NUMA placement, KV cache sizing, batch limits, and model selection. Pair it with the official [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/) for package-specific setup details. ## Table of Contents @@ -15,6 +11,7 @@ vLLM CPU docs: --device cpu --dtype=bfloat16 ``` -Why this matters: +BF16 is the recommended CPU dtype: it halves memory traffic versus FP32 and unlocks AMX BF16 matrix kernels for matrix-heavy LLM operations. For further gains, use INT8-quantized models to engage AMX INT8 kernels (see next section). + +## Quantization (INT8 / W8A8) + +INT8 quantization (e.g., `compressed-tensors` W8A8) reduces model weight memory and memory-bandwidth pressure, which is often the bottleneck for CPU LLM inference. vLLM automatically selects AMX INT8 kernels when the model ships a compatible quantization config — no extra CLI flag is required beyond pointing at the quantized model: + +```bash +vllm serve /-w8a8 \ + --device cpu \ + --dtype=bfloat16 \ + --max-num-batched-tokens 2048 \ + --max-num-seqs 64 +``` + +`--dtype=bfloat16` here sets the activation/compute dtype; INT8 weights are loaded according to the model's quantization config. + +When to consider INT8: -- BF16 is the recommended CPU dtype when FP16 behavior is unstable or slower on CPU. -- BF16 reduces memory traffic compared with FP32. -- AMX BF16 kernels accelerate matrix-heavy LLM operations when the CPU, PyTorch, and vLLM CPU backend can use them. -- AMX INT8 kernels provide additional acceleration for INT8-quantized models (e.g., W8A8 compressed-tensor), reducing both memory footprint and compute cost. Use INT8 after validating model quality with your target prompts. +- Memory-bandwidth-bound workloads on Xeon (most LLM decode phases). +- Larger models that don't fit comfortably per NUMA node in BF16. +- Higher-concurrency serving where KV cache competes with weights for DRAM. + +Always validate accuracy on your target prompts before adopting INT8, and benchmark BF16 vs INT8 end-to-end — INT8 reduces compute and bandwidth cost but can shift quality on some tasks. ## CPU Threading and NUMA @@ -231,17 +236,13 @@ If the worker exits with code 9 or the process is killed by the OOM killer, redu ## Large Models on Xeon Memory Capacity -Many GPU deployments are constrained by the memory capacity of a single accelerator. Intel Xeon servers can be configured with substantially larger system memory, which can be useful when the model, KV cache, or context length does not fit comfortably in accelerator memory. - -This does not make CPU inference universally faster than GPU inference. It changes the design space: - -- Use Xeon when capacity, cost, data locality, or CPU-only deployment constraints dominate. -- Use quantization to reduce model memory and memory bandwidth pressure. -- Use tensor parallelism across NUMA nodes when a model shard plus KV cache fits cleanly per node. -- Prefer smaller batch sizes for interactive latency and larger batch sizes for offline throughput. -- Avoid filling all DRAM with weights and KV cache; memory headroom is what keeps tail latency stable. +Intel Xeon servers can be configured with substantially more system memory than a single accelerator, which is useful when the model, KV cache, or context length doesn't fit comfortably in accelerator memory. This doesn't make CPU inference universally faster than GPU — it changes the design space: -For very large models, benchmark both BF16 and quantized variants. A quantized model may reduce memory traffic enough to improve throughput, but accuracy, prompt behavior, and supported kernels must be validated for your model. +- Use Xeon when capacity, cost, data locality, or CPU-only constraints dominate. +- Use quantization (see [Quantization](#quantization-int8--w8a8)) to cut weight memory and bandwidth pressure. +- Use tensor parallelism across NUMA nodes when a shard plus KV cache fits cleanly per node. +- Prefer smaller batches for interactive latency, larger batches for offline throughput. +- Keep DRAM headroom — filling all memory with weights and KV cache destabilizes tail latency. ## Tuning Reference From 8bc119ae3bc823b6fe088a396188e80f4b37dd7b Mon Sep 17 00:00:00 2001 From: lucasmelogithub Date: Wed, 6 May 2026 08:16:21 -0500 Subject: [PATCH 4/8] Enahnced Readme, make it simpler. --- software/vllm/README.md | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-) diff --git a/software/vllm/README.md b/software/vllm/README.md index a62ec2d..6c8d79b 100644 --- a/software/vllm/README.md +++ b/software/vllm/README.md @@ -44,9 +44,9 @@ Use this recipe when you want to: | --- | --- | | vLLM version | Use vLLM `0.17.0` cpu container or newer. | | CPU | Intel Xeon 6 is recommended as of May 2026. Intel Xeon 4th Gen or newer with `amx_tile`, `amx_bf16`, and `amx_int8` CPU flags should be used. | -| dtype | Use `--dtype=bfloat16` for AMX BF16 serving. For INT8 quantized models, AMX INT8 kernels are used automatically when the model provides a compatible quantization config. | -| Memory | Size `VLLM_CPU_KVCACHE_SPACE=40` is a good starting point | -| Threading | Start with `VLLM_CPU_OMP_THREADS_BIND=auto` | +| dtype | Use `--dtype=bfloat16`. Also works for INT8 quantized models. | +| Memory | Size `VLLM_CPU_KVCACHE_SPACE=40` is a good starting point. | +| Threading | Start with `VLLM_CPU_OMP_THREADS_BIND=auto` . | | Parallelism | On multi-socket systems, start with tensor parallel size equal to the number of NUMA nodes, except values that the current vLLM release does not support. | | Python | Python 3.10 through 3.13, following the vLLM CPU installation guide. | @@ -57,7 +57,6 @@ Verify the platform before tuning vLLM. ```bash lscpu | grep -E "Model name|Socket|Core|Thread|NUMA node|Flags" lscpu | grep -E "amx_(tile|bf16|int8)|avx512_bf16" -numactl --hardware ``` Expected CPU flags for AMX acceleration: @@ -215,14 +214,6 @@ Use this sizing rule for each rank: local NUMA memory > model weight shard + VLLM_CPU_KVCACHE_SPACE + runtime workspace + OS headroom ``` -Estimate BF16 model weight memory as: - -```text -weight shard GiB ~= model parameters * 2 bytes / tensor_parallel_size / 2^30 -``` - -Then leave headroom for activation buffers, tokenizer/server processes, page cache, framework overhead, and other colocated services. A practical starting point is to reserve at least 10-20% of each NUMA node's memory instead of assigning all free memory to KV cache. - Examples: | Scenario | Starting point | Why | From 751a92da7fbc33f2e62f3b20e253f4fbffad6b34 Mon Sep 17 00:00:00 2001 From: lucasmelogithub Date: Tue, 12 May 2026 09:43:38 -0500 Subject: [PATCH 5/8] Enhance docs --- software/vllm/README.md | 355 +++++++++++++--------------------------- 1 file changed, 110 insertions(+), 245 deletions(-) diff --git a/software/vllm/README.md b/software/vllm/README.md index 6c8d79b..857bec4 100644 --- a/software/vllm/README.md +++ b/software/vllm/README.md @@ -1,339 +1,204 @@ -# vLLM on Intel Xeon with Intel AMX - -This recipe configures vLLM for CPU inference and serving on Intel Xeon processors with Intel Advanced Matrix Extensions (Intel AMX). It focuses on the settings that most affect performance: BF16 execution, CPU thread binding, NUMA placement, KV cache sizing, batch limits, and model selection. Pair it with the official [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/) for package-specific setup details. - -## Table of Contents - -- [vLLM on Intel Xeon with Intel AMX](#vllm-on-intel-xeon-with-intel-amx) - - [Table of Contents](#table-of-contents) - - [Overview](#overview) - - [Recommended Baseline](#recommended-baseline) - - [Prerequisites](#prerequisites) - - [Serve and validate vLLM with Docker](#serve-and-validate-vllm-with-docker) - - [AMX and BF16 Configuration](#amx-and-bf16-configuration) - - [Quantization (INT8 / W8A8)](#quantization-int8--w8a8) - - [CPU Threading and NUMA](#cpu-threading-and-numa) - - [Default Recommendation](#default-recommendation) - - [Manual Binding Example](#manual-binding-example) - - [NUMA Checklist](#numa-checklist) - - [KV Cache and Memory Sizing](#kv-cache-and-memory-sizing) - - [Large Models on Xeon Memory Capacity](#large-models-on-xeon-memory-capacity) - - [Tuning Reference](#tuning-reference) - - [Validation and Benchmarking](#validation-and-benchmarking) - - [Functional Validation](#functional-validation) - - [Placement Validation](#placement-validation) - - [Benchmark Sweep](#benchmark-sweep) - - [Example Benchmark Command](#example-benchmark-command) - - [Disclaimer](#disclaimer) - - [References](#references) - -## Overview - -vLLM serves models on x86 CPUs with FP32, FP16, and BF16. On Intel Xeon processors with Intel AMX, BF16 is the preferred dtype: it cuts memory traffic and enables AMX BF16 matrix kernels. AMX also supports INT8 (`amx_int8`), which vLLM uses automatically for INT8-quantized models (e.g., compressed-tensors W8A8) to further reduce memory and bandwidth. - -Use this recipe when you want to: - -- Serve small language models (SLMs) without a discrete accelerator. -- Host models or context lengths that benefit from the larger DRAM capacity available on CPU servers. -- Run inference close to CPU-resident data pipelines, vector databases, or enterprise services. -- Tune vLLM CPU deployments beyond a default install. - -## Recommended Baseline +# vLLM on Intel Xeon CPUs + +This recipe gets vLLM's CPU backend running on Intel Xeon processors and captures the few settings that usually move performance: BF16, AMX, NUMA placement, KV cache size, and batch limits. Use it with the official [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/) for release-specific details. + +## Requirements | Item | Recommendation | | --- | --- | -| vLLM version | Use vLLM `0.17.0` cpu container or newer. | -| CPU | Intel Xeon 6 is recommended as of May 2026. Intel Xeon 4th Gen or newer with `amx_tile`, `amx_bf16`, and `amx_int8` CPU flags should be used. | -| dtype | Use `--dtype=bfloat16`. Also works for INT8 quantized models. | -| Memory | Size `VLLM_CPU_KVCACHE_SPACE=40` is a good starting point. | -| Threading | Start with `VLLM_CPU_OMP_THREADS_BIND=auto` . | -| Parallelism | On multi-socket systems, start with tensor parallel size equal to the number of NUMA nodes, except values that the current vLLM release does not support. | -| Python | Python 3.10 through 3.13, following the vLLM CPU installation guide. | - -## Prerequisites +| OS | Linux | +| Python | 3.10 through 3.13 | +| vLLM | `0.17.0` or newer for x86 CPU wheels/images | +| CPU flags | `avx512f` recommended; `avx2` has limited features | +| Intel Xeon | 4th Gen or newer with `amx_tile`, `amx_bf16`, and `amx_int8` for best BF16/INT8 performance | +| dtype | Start with `--dtype=bfloat16` | -Verify the platform before tuning vLLM. +Install the small host tools used by the commands below: ```bash -lscpu | grep -E "Model name|Socket|Core|Thread|NUMA node|Flags" -lscpu | grep -E "amx_(tile|bf16|int8)|avx512_bf16" +sudo apt-get update +sudo apt-get install -y --no-install-recommends curl git jq numactl htop ``` -Expected CPU flags for AMX acceleration: +Install [uv](https://docs.astral.sh/uv/#getting-started) before using the Python wheel or source-build path. + +## Performance Knobs -- `amx_tile` — required base for all AMX paths -- `amx_bf16` — enables AMX BF16 matrix operations (primary inference dtype) -- `amx_int8` — enables AMX INT8 matrix operations (used by INT8 quantized models) -- `avx512_bf16` — scalar/vector BF16 support complementing AMX +| Setting | Starting point | Why it matters | +| --- | --- | --- | +| `--dtype=bfloat16` | Always on AMX-capable Xeon | Enables the preferred CPU dtype and AMX BF16 kernels. | +| `VLLM_CPU_KVCACHE_SPACE` | `20` to `40` GiB per rank | Larger values allow more concurrency and context, but must fit per NUMA node. | +| `VLLM_CPU_OMP_THREADS_BIND` | `auto` | Binds OpenMP worker threads to NUMA-local cores. Use ranges such as `0-31\|32-63` for manual control. | +| `VLLM_CPU_NUM_OF_RESERVED_CPU` | `1` or `2` | Leaves cores for API serving, tokenization, networking, logging, and OS work. | +| `--tensor-parallel-size` | NUMA node count, where supported | Keeps model shards close to local memory; current vLLM CPU releases do not support `6`. | +| `--max-num-batched-tokens` | Online: `2048 * world_size`; offline: `4096 * world_size` | Tune for prefill throughput and time to first token. | +| `--max-num-seqs` | Online: `128 * world_size`; offline: `256 * world_size` | Tune for decode throughput and inter-token latency. | +| `--block-size` | Default or multiples of `32` | Useful during controlled CPU sweeps. | +| `VLLM_CPU_SGL_KERNEL` | `0`, try `1` for low-latency SLM serving | Experimental x86 small-batch kernels; requires AMX, BF16 weights, and compatible shapes. | -## Serve and validate vLLM with Docker +`world_size` is the product of tensor, pipeline, and data parallel ranks. -vLLM publishes pre-built CPU Docker images. Pull the latest x86_64 CPU image: +## Check the Hardware ```bash -docker pull vllm/vllm-openai-cpu:latest-x86_64 +lscpu | grep -E "Model name|Socket|Core|Thread|NUMA node|Flags" +lscpu | grep -E "avx512f|avx2|amx_(tile|bf16|int8)|avx512_bf16" +numactl --hardware ``` -Then run it with the environment variables from above: +## Fast Path: Docker ```bash -export VLLM_CPU_KVCACHE_SPACE=40 -export VLLM_CPU_OMP_THREADS_BIND=auto -export VLLM_CPU_NUM_OF_RESERVED_CPU=1 -export VLLM_CPU_SGL_KERNEL=1 +export VLLM_VERSION=0.20.2 +docker pull vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64 docker run --rm \ + --name vllm-cpu \ --security-opt seccomp=unconfined \ --cap-add SYS_NICE \ --shm-size=4g \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ - -e VLLM_CPU_KVCACHE_SPACE=40 \ -e HF_TOKEN="${HF_TOKEN}" \ - vllm/vllm-openai-cpu:latest-x86_64 \ - Qwen/Qwen3-4B \ + -e VLLM_CPU_KVCACHE_SPACE=40 \ + -e VLLM_CPU_OMP_THREADS_BIND=auto \ + -e VLLM_CPU_NUM_OF_RESERVED_CPU=1 \ + vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64 \ + ibm-granite/granite-3.0-3b-a800m-instruct \ --dtype=bfloat16 \ --max-num-batched-tokens 2048 \ --max-num-seqs 64 ``` -Note: `--security-opt seccomp=unconfined` and `--cap-add SYS_NICE` are needed for NUMA memory policy calls inside the container. Omitting them may produce `get_mempolicy: Operation not permitted` warnings. +Use the `-x86_64` CPU image tag when pinning a version; `latest-x86_64` is the unpinned option. -Or run vLLM directly (same env vars apply): - -```bash -vllm serve Qwen/Qwen3-4B \ - --device cpu \ - --dtype=bfloat16 \ - --max-num-batched-tokens 2048 \ - --max-num-seqs 64 -``` +`SYS_NICE` and `seccomp=unconfined` allow vLLM's NUMA memory policy calls inside Docker. Without them, serving can still work, but NUMA placement may be weaker and logs can show `get_mempolicy: Operation not permitted`. -Send a test request: +Validate the OpenAI-compatible endpoint: ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "Qwen/Qwen3-4B", - "messages": [{"role": "user", "content": "Give three tips for CPU inference tuning."}], + "model": "ibm-granite/granite-3.0-3b-a800m-instruct", + "messages": [{"role": "user", "content": "Give three CPU inference tuning tips."}], "max_tokens": 128 }' ``` -For a multi-NUMA system, start by matching tensor parallel size to the NUMA node count: +## Python Wheel Install + +Use this path when you need a local Python environment instead of Docker. ```bash -NUMA_NODES=$(lscpu | awk '/NUMA node\(s\):/ {print $3}') +uv venv --python 3.12 --seed --managed-python +source .venv/bin/activate -vllm serve meta-llama/Llama-3.1-8B-Instruct \ - --device cpu \ - --dtype=bfloat16 \ - --tensor-parallel-size "${NUMA_NODES}" \ - --max-num-batched-tokens $((2048 * NUMA_NODES)) \ - --max-num-seqs $((128 * NUMA_NODES)) +export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//') +uv pip install \ + "https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl" \ + --torch-backend cpu ``` -Before using a generated `NUMA_NODES` value, check the vLLM CPU documentation for currently unsupported tensor parallel sizes. For example, the current CPU guide notes that `tensor-parallel-size=6` is not supported. - -## AMX and BF16 Configuration - -For AMX-capable Intel Xeon CPUs, the most important vLLM decision is to use BF16 explicitly: +For latest main-branch CPU wheels: ```bash -vllm serve --device cpu --dtype=bfloat16 +uv pip install vllm \ + --extra-index-url https://wheels.vllm.ai/nightly/cpu \ + --index-strategy first-index \ + --torch-backend cpu ``` -BF16 is the recommended CPU dtype: it halves memory traffic versus FP32 and unlocks AMX BF16 matrix kernels for matrix-heavy LLM operations. For further gains, use INT8-quantized models to engage AMX INT8 kernels (see next section). - -## Quantization (INT8 / W8A8) - -INT8 quantization (e.g., `compressed-tensors` W8A8) reduces model weight memory and memory-bandwidth pressure, which is often the bottleneck for CPU LLM inference. vLLM automatically selects AMX INT8 kernels when the model ships a compatible quantization config — no extra CLI flag is required beyond pointing at the quantized model: +Before serving from CPU wheels or source builds, preload TCMalloc and Intel OpenMP: ```bash -vllm serve /-w8a8 \ - --device cpu \ - --dtype=bfloat16 \ - --max-num-batched-tokens 2048 \ - --max-num-seqs 64 -``` - -`--dtype=bfloat16` here sets the activation/compute dtype; INT8 weights are loaded according to the model's quantization config. - -When to consider INT8: - -- Memory-bandwidth-bound workloads on Xeon (most LLM decode phases). -- Larger models that don't fit comfortably per NUMA node in BF16. -- Higher-concurrency serving where KV cache competes with weights for DRAM. +sudo apt-get update +sudo apt-get install -y --no-install-recommends libtcmalloc-minimal4 -Always validate accuracy on your target prompts before adopting INT8, and benchmark BF16 vs INT8 end-to-end — INT8 reduces compute and bandwidth cost but can shift quality on some tasks. - -## CPU Threading and NUMA - -The CPU backend is sensitive to where OpenMP threads run and where memory is allocated. Start with automatic binding, then move to manual binding if utilization or latency is uneven. - -### Default Recommendation - -```bash -export VLLM_CPU_OMP_THREADS_BIND=auto -export VLLM_CPU_NUM_OF_RESERVED_CPU=1 +TC_PATH=$(find /usr -name 'libtcmalloc_minimal.so.4' 2>/dev/null | head -n 1) +IOMP_PATH=$(find .venv -name 'libiomp5.so' 2>/dev/null | head -n 1) +export LD_PRELOAD="${TC_PATH}:${IOMP_PATH}:${LD_PRELOAD}" ``` -`auto` binds OpenMP threads for each rank to CPU cores in NUMA nodes. Reserving one or two cores prevents the vLLM API server, tokenizer work, networking, logging, and operating system tasks from competing with inference threads. - -### Manual Binding Example - -Use manual binding when you need repeatability or when `htop` shows threads crossing NUMA nodes unexpectedly. +Run locally: ```bash -export VLLM_CPU_OMP_THREADS_BIND=0-55|56-111 export VLLM_CPU_KVCACHE_SPACE=40 +export VLLM_CPU_OMP_THREADS_BIND=auto +export VLLM_CPU_NUM_OF_RESERVED_CPU=1 -vllm serve \ +vllm serve ibm-granite/granite-3.0-3b-a800m-instruct \ --device cpu \ --dtype=bfloat16 \ - --tensor-parallel-size 2 + --max-num-batched-tokens 2048 \ + --max-num-seqs 64 ``` -In this example, rank 0 uses CPU cores `0-55` and rank 1 uses CPU cores `56-111`. Adjust the ranges to physical cores from the same NUMA node. Avoid spreading a single rank across sockets unless you have measured that it helps your workload. - -### NUMA Checklist - -- Use `numactl --hardware` to identify NUMA nodes and memory per node. -- Keep each tensor-parallel or pipeline-parallel rank within one NUMA node when possible. -- Use `CPU_VISIBLE_MEMORY_NODES` to mask or reorder NUMA memory nodes when using automatic binding. -- Watch CPU placement with `htop` or `perf stat` during warmup and benchmark runs. +## Source Build Escape Hatch -## KV Cache and Memory Sizing +Use source only when CPU wheels/images are unavailable or you need a local vLLM change. -`VLLM_CPU_KVCACHE_SPACE` is specified in GiB and applies to each CPU worker/rank. Larger values allow more concurrent requests and longer contexts, but the allocation must fit in the local memory budget for each NUMA node. +```bash +sudo apt-get update +sudo apt-get install -y gcc-12 g++-12 libnuma-dev -Use this sizing rule for each rank: +git clone https://github.com/vllm-project/vllm.git vllm_source +cd vllm_source -```text -local NUMA memory > model weight shard + VLLM_CPU_KVCACHE_SPACE + runtime workspace + OS headroom +uv venv --python 3.12 --seed --managed-python +source .venv/bin/activate +uv pip install -r requirements/build/cpu.txt --torch-backend cpu +uv pip install -r requirements/cpu.txt --torch-backend cpu +VLLM_TARGET_DEVICE=cpu uv pip install . --no-build-isolation ``` -Examples: - -| Scenario | Starting point | Why | -| --- | --- | --- | -| SLM, low concurrency | `VLLM_CPU_KVCACHE_SPACE=10` to `20` | Keeps memory pressure low while validating BF16 and thread placement. | -| SLM, higher concurrency | `VLLM_CPU_KVCACHE_SPACE=20` to `40` | Supports more simultaneous sessions and longer prompts. | -| 8B-class model on a large-memory node | `VLLM_CPU_KVCACHE_SPACE=40` or higher | Uses Xeon DRAM capacity for larger batches or context lengths. | -| Multi-NUMA tensor parallel | Size per NUMA node | Each rank needs local memory for its weight shard plus its KV cache. | - -If the worker exits with code 9 or the process is killed by the OOM killer, reduce `VLLM_CPU_KVCACHE_SPACE`, reduce batch limits, lower tensor-parallel pressure per node, or use a smaller/quantized model. - -## Large Models on Xeon Memory Capacity +If CMake detects CUDA during a CPU build, add `CMAKE_DISABLE_FIND_PACKAGE_CUDA=ON`. If NumPy breaks imports, pin `numpy<2.0`. -Intel Xeon servers can be configured with substantially more system memory than a single accelerator, which is useful when the model, KV cache, or context length doesn't fit comfortably in accelerator memory. This doesn't make CPU inference universally faster than GPU — it changes the design space: +## Model and Quantization Notes -- Use Xeon when capacity, cost, data locality, or CPU-only constraints dominate. -- Use quantization (see [Quantization](#quantization-int8--w8a8)) to cut weight memory and bandwidth pressure. -- Use tensor parallelism across NUMA nodes when a shard plus KV cache fits cleanly per node. -- Prefer smaller batches for interactive latency, larger batches for offline throughput. -- Keep DRAM headroom — filling all memory with weights and KV cache destabilizes tail latency. +- This README uses `ibm-granite/granite-3.0-3b-a800m-instruct` as a compact Granite MoE example: 3.3B total parameters with about 800M active parameters per token. +- Check the official [CPU-supported model list](https://docs.vllm.ai/en/stable/models/hardware_supported_models/cpu/) before choosing a model. +- Prefer BF16 first; compare INT8 only after functional quality is acceptable. +- CPU quantization support includes AWQ and GPTQ on x86, plus compressed-tensors INT8 W8A8 on x86 and s390x. +- INT8 can reduce weight memory and bandwidth pressure, especially during decode-heavy serving. -## Tuning Reference +## Validate and Tune -| Setting | What it controls | Starting point | Tune when | -| --- | --- | --- | --- | -| `--dtype=bfloat16` | Model compute dtype | Always use on AMX-capable Xeon unless a model requires otherwise. | Accuracy or compatibility issues appear. | -| `VLLM_CPU_KVCACHE_SPACE` | KV cache memory per CPU worker/rank, in GiB | `20` for SLMs; `40` or higher for larger models or concurrency. | You see preemption, OOM, low concurrency, or long-context failures. | -| `VLLM_CPU_OMP_THREADS_BIND` | OpenMP thread placement | `auto` | CPU utilization is uneven or threads cross NUMA nodes. | -| `VLLM_CPU_NUM_OF_RESERVED_CPU` | Cores reserved from OpenMP binding | `1` for small systems, `1-2` for serving workloads. | API server latency rises or CPU oversubscription appears. | -| `CPU_VISIBLE_MEMORY_NODES` | NUMA memory node visibility and order | Leave unset initially. | You need to mask NUMA nodes or control binding sequence. | -| `--tensor-parallel-size` | Weight sharding across ranks | Number of NUMA nodes, where supported. | Model shard plus KV cache does not fit per node or throughput scales poorly. | -| `--pipeline-parallel-size` | Layer partitioning across ranks | `1` initially. | Model is too large or TP alone does not fit cleanly. | -| `--data-parallel-size` | Independent replica count | `1` initially. | Throughput is limited and enough sockets/nodes are available. | -| `--max-num-batched-tokens` | Tokens allowed in one scheduler batch | Online: `2048 * world_size`; offline: `4096 * world_size`. | Time to first token or throughput misses the target. | -| `--max-num-seqs` | Sequences allowed in one scheduler batch | Online: `128 * world_size`; offline: `256 * world_size`. | Inter-token latency or output throughput misses the target. | -| `--block-size` | KV cache block granularity | Keep the default or use multiples of 32. | You are doing controlled CPU performance sweeps. | -| `VLLM_CPU_SGL_KERNEL` | Experimental small-batch optimized x86 kernels | `0` initially. | Low-latency SLM serving is stable and the model meets AMX/BF16/shape requirements. | - -`world_size` is the product of tensor, pipeline, and data parallel ranks used by the vLLM deployment. - -## Validation and Benchmarking - -Use repeatable validation before changing multiple knobs at once. - -### Functional Validation +Start the Docker or Python server first. If it is running in the foreground, open another terminal for these checks: ```bash -vllm collect-env -lscpu | grep -E "amx_(tile|bf16|int8)|avx512_bf16" -numactl --hardware -``` - -Start the server with one known model and send a short request. Confirm that the response is correct before increasing batch size or parallelism. +# Docker path, because the server above is named vllm-cpu. +docker exec vllm-cpu vllm collect-env -### Placement Validation - -While vLLM is serving traffic, check that inference threads stay on the intended cores: +# Python wheel or source-build path. +vllm collect-env -```bash +curl -s http://localhost:8000/v1/models | jq . +SERVER_PID=$(pgrep -f 'vllm serve|api_server' | head -n 1) htop +numastat -p "${SERVER_PID}" ``` -For a scriptable check, run a short benchmark and record CPU, memory, and NUMA behavior with tools such as `perf stat`, `numastat`, or platform telemetry. - -### Benchmark Sweep - -For each model and hardware configuration, sweep these values independently: - -- `VLLM_CPU_KVCACHE_SPACE` -- `--max-num-batched-tokens` -- `--max-num-seqs` -- `--tensor-parallel-size` -- `VLLM_CPU_OMP_THREADS_BIND` -- quantized versus BF16 weights - -Track at least these metrics: +Use one known-good model and change one knob at a time. Track TTFT, TPOT, output tokens per second, requests per second, peak RSS, NUMA locality, and OOM events. -- Time to first token (TTFT) -- Inter-token latency (ITL) -- Output tokens per second -- Requests per second -- Peak RSS and memory per NUMA node -- CPU utilization per socket -- Error rate and OOM events - -Use the vLLM benchmark CLI or the vLLM Benchmark Suite for repeatable comparisons. For CPU-supported models, the vLLM documentation points to CPU benchmark test cases that include optimized example configurations and dry-run command generation. - -### Example Benchmark Command - -Run a latency benchmark with the vLLM CLI: +The vLLM Benchmark Suite lives in the vLLM source tree. Clone it, or reuse the source checkout from the source-build path, even if vLLM is already installed. Exact `MODEL_FILTER` values must exist in the CPU test JSON; the current suite includes `ibm-granite/granite-3.2-2b-instruct` as the pre-curated Granite profile. ```bash -vllm bench latency \ - --model Qwen/Qwen3-4B \ - --input-len 256 \ - --output-len 128 \ - --batch-size 8 \ - --dtype bfloat16 \ - --device cpu -``` - -Or, from a [vLLM source checkout](https://github.com/vllm-project/vllm), use the Benchmark Suite dry-run to generate optimized serving commands for CPU models: +git clone https://github.com/vllm-project/vllm.git vllm_source +cd vllm_source -```bash ON_CPU=1 SERVING_JSON=serving-tests-cpu-text.json DRY_RUN=1 \ - MODEL_FILTER=Qwen/Qwen3-4B DTYPE_FILTER=bfloat16 \ + MODEL_FILTER=ibm-granite/granite-3.2-2b-instruct DTYPE_FILTER=bfloat16 \ bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh -``` -The generated `.commands` files in `./benchmark/results/` contain the full CLI invocations with optimized settings for each model. - -## Disclaimer +find benchmark/results -maxdepth 2 -name "*.commands" -print +``` -Performance varies by use, configuration, and other factors. Learn more on the [Performance Index site](https://edc.intel.com/content/www/us/en/products/performance/benchmarks/overview/). No product or component can be absolutely secure. Intel technologies may require enabled hardware, software, or service activation. See [Legal Notices and Disclaimers](https://www.intel.com/LegalNoticesAndDisclaimers). +Use the generated `.commands` files as the baseline. When moving back to the Granite MoE serving model, replace the model ID and change only one of `VLLM_CPU_KVCACHE_SPACE`, `VLLM_CPU_OMP_THREADS_BIND`, `--max-num-batched-tokens`, `--max-num-seqs`, or `--block-size` per run. ## References - [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/) -- [vLLM CPU hardware-supported models for Intel Xeon](https://docs.vllm.ai/en/stable/models/hardware_supported_models/cpu/) +- [vLLM CPU-supported models](https://docs.vllm.ai/en/stable/models/hardware_supported_models/cpu/) - [vLLM optimization and tuning guide](https://docs.vllm.ai/en/stable/configuration/optimization/) - [vLLM Intel quantization support](https://docs.vllm.ai/en/stable/features/quantization/inc/) -- [vLLM CPU installation documentation source](https://github.com/vllm-project/vllm/blob/main/docs/getting_started/installation/cpu.md) -- [vLLM GitHub repository](https://github.com/vllm-project/vllm) From 8a32f4b3e027de14c14b9e879e10188a80de968a Mon Sep 17 00:00:00 2001 From: lucasmelogithub Date: Fri, 29 May 2026 10:02:14 -0500 Subject: [PATCH 6/8] Enhance Readme and update skill --- software/vllm/README.md | 322 ++++++++++++------ software/vllm/skill/SKILL.md | 183 ++++++++++ .../vllm/skill/references/tuning-matrix.md | 34 ++ 3 files changed, 432 insertions(+), 107 deletions(-) create mode 100644 software/vllm/skill/SKILL.md create mode 100644 software/vllm/skill/references/tuning-matrix.md diff --git a/software/vllm/README.md b/software/vllm/README.md index 857bec4..0138f3b 100644 --- a/software/vllm/README.md +++ b/software/vllm/README.md @@ -1,44 +1,100 @@ # vLLM on Intel Xeon CPUs -This recipe gets vLLM's CPU backend running on Intel Xeon processors and captures the few settings that usually move performance: BF16, AMX, NUMA placement, KV cache size, and batch limits. Use it with the official [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/) for release-specific details. +This guide provides guidance on running vLLM on Intel Xeon processors. -## Requirements +## Upstream First -| Item | Recommendation | +Intel invests significant efforts upstreaming code optimizations and documentation directly to the official vLLM repositories. Those upstream contributions form the foundation of Intel Xeon CPU performance in vLLM. This guide is only a small extension of that work—collecting practical deployment tips in one place. Users should always consult the official documentation. + +- [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/) +- [vLLM SLM/LLM Recipes](https://recipes.vllm.ai/) +- [vLLM Benchmarking](https://docs.vllm.ai/en/stable/benchmarking/cli/) + +## Use with AI Coding Agents + +This recipe ships a companion [Agent Skill](./skill/SKILL.md) (`vllm-xeon-cpu`) that lets AI coding agents — GitHub Copilot, Claude Code, and other `AGENTS.md`-aware tools — deploy, tune, validate, and benchmark vLLM on Intel Xeon CPUs on a customer's behalf. The skill is a self-contained, markdown-only payload under [`skill/`](./skill/) that you copy into your own workspace or user profile. + +> **Folder name must match `name`.** When you install the skill, the destination folder **must** be named `vllm-xeon-cpu` (matching the `name:` field in the skill's frontmatter). Otherwise the agent will not discover it. + +Install the skill once per workspace or user profile. Pick the install path for your agent runtime: + +| Runtime | Install path | Notes | +| --- | --- | --- | +| GitHub Copilot (workspace) | `.github/skills/vllm-xeon-cpu/` | Shared with everyone working in the repo and with the Copilot coding agent on PRs / issues. | +| GitHub Copilot (personal) | `~/.copilot/skills/vllm-xeon-cpu/` | Available across all your workspaces; not shared. | +| Claude Code (workspace) | `.claude/skills/vllm-xeon-cpu/` | Shared via the repo. | + +### GitHub Copilot — Repo Workspace + +```bash +mkdir -p .github/skills/vllm-xeon-cpu +curl -L https://github.com/intel/optimization-zone/archive/refs/heads/main.tar.gz \ + | tar -xz --strip-components=4 -C .github/skills/vllm-xeon-cpu \ + optimization-zone-main/software/vllm/skill +``` + +### GitHub Copilot — User profile + +```bash +mkdir -p ~/.copilot/skills/vllm-xeon-cpu +curl -L https://github.com/intel/optimization-zone/archive/refs/heads/main.tar.gz \ + | tar -xz --strip-components=4 -C ~/.copilot/skills/vllm-xeon-cpu \ + optimization-zone-main/software/vllm/skill +``` + +### Claude Code — Repo Workspace + +```bash +mkdir -p .claude/skills/vllm-xeon-cpu +curl -L https://github.com/intel/optimization-zone/archive/refs/heads/main.tar.gz \ + | tar -xz --strip-components=4 -C .claude/skills/vllm-xeon-cpu \ + optimization-zone-main/software/vllm/skill +``` + +After install, invoke from chat with `/vllm-xeon-cpu` or let the agent auto-load the skill when your request matches keywords like "vLLM", "Xeon". + +## Intel Xeon SLM/LLM Sizing Guidance + +For guidance around SLM/LLM sizing on Intel Xeon CPUs, please see our Xeon Processor Advisor Tool & AI Software Catalog: + +- [Cloud Intel Xeon AI Performance Advisor](https://xeonprocessoradvisor.intel.com/csp-ai-performance-advisor) +- [On-prem Intel Xeon AI Performance Advisor](https://xeonprocessoradvisor.intel.com/on-prem-ai-performance-advisor) +- [Intel AI Software Catalog - Model Guidance](https://swcatalog.intel.com/models) + +## vLLM Requirements Guidance + +| Item | Guidance | | --- | --- | | OS | Linux | | Python | 3.10 through 3.13 | -| vLLM | `0.17.0` or newer for x86 CPU wheels/images | -| CPU flags | `avx512f` recommended; `avx2` has limited features | -| Intel Xeon | 4th Gen or newer with `amx_tile`, `amx_bf16`, and `amx_int8` for best BF16/INT8 performance | -| dtype | Start with `--dtype=bfloat16` | +| vLLM | `v0.17.0` or newer | +| Intel AMX Xeon CPU Flags | 4th Gen or newer with `amx_tile`, `amx_bf16`, and `amx_int8` for best BF16/INT8 performance | + +## Performance Guidance + +| Setting | Guidance | Why it matters | +| --- | --- | --- | +| `--dtype=bfloat16` | Use `bfloat16` on Intel Xeon with Intel AMX | Enables the preferred CPU dtype and AMX | +| `VLLM_CPU_KVCACHE_SPACE` | `20` to `40` GiB or larger | Larger values allow more concurrency and context, but must fit per NUMA node. | +| `VLLM_CPU_OMP_THREADS_BIND` | `auto` | Binds OpenMP worker threads to NUMA-local cores. Use ranges such as `0-31\|32-63` for manual control, `auto` preferred. | +| `VLLM_CPU_NUM_OF_RESERVED_CPU` | `1` | Sets a core for API serving, tokenization, networking, logging, and OS work. | +| `--tensor-parallel-size` | Use default for single NUMA or set to NUMA node count | Keeps model shards close to local memory; current vLLM CPU releases do not support `6`. | +| `--max-num-batched-tokens` | Online: `2048`; offline: `4096` | Maximum number of batched tokens per iteration. Tune for prefill throughput and time to first token. | +| `--max-num-seqs` | Online: `128`; offline: `256` | Maximum number of sequences per iteration. Tune for decode throughput and inter-token latency. | +| `VLLM_CPU_SGL_KERNEL` | `0`, or try `1` for low-latency SLM serving | Experimental x86 small-batch kernels; requires AMX, BF16 weights, and compatible shapes. | + +## Utility Tools Install the small host tools used by the commands below: ```bash sudo apt-get update -sudo apt-get install -y --no-install-recommends curl git jq numactl htop +sudo apt-get install -y --no-install-recommends curl git jq numactl htop python3-venv python3-full g++ python3-dev ``` -Install [uv](https://docs.astral.sh/uv/#getting-started) before using the Python wheel or source-build path. +## Hardware Validation -## Performance Knobs - -| Setting | Starting point | Why it matters | -| --- | --- | --- | -| `--dtype=bfloat16` | Always on AMX-capable Xeon | Enables the preferred CPU dtype and AMX BF16 kernels. | -| `VLLM_CPU_KVCACHE_SPACE` | `20` to `40` GiB per rank | Larger values allow more concurrency and context, but must fit per NUMA node. | -| `VLLM_CPU_OMP_THREADS_BIND` | `auto` | Binds OpenMP worker threads to NUMA-local cores. Use ranges such as `0-31\|32-63` for manual control. | -| `VLLM_CPU_NUM_OF_RESERVED_CPU` | `1` or `2` | Leaves cores for API serving, tokenization, networking, logging, and OS work. | -| `--tensor-parallel-size` | NUMA node count, where supported | Keeps model shards close to local memory; current vLLM CPU releases do not support `6`. | -| `--max-num-batched-tokens` | Online: `2048 * world_size`; offline: `4096 * world_size` | Tune for prefill throughput and time to first token. | -| `--max-num-seqs` | Online: `128 * world_size`; offline: `256 * world_size` | Tune for decode throughput and inter-token latency. | -| `--block-size` | Default or multiples of `32` | Useful during controlled CPU sweeps. | -| `VLLM_CPU_SGL_KERNEL` | `0`, try `1` for low-latency SLM serving | Experimental x86 small-batch kernels; requires AMX, BF16 weights, and compatible shapes. | - -`world_size` is the product of tensor, pipeline, and data parallel ranks. - -## Check the Hardware +Validate the CPU model, core count, thread count, NUMA topology, and important flags such as `avx512f`, `avx2`, `amx_tile`, `amx_bf16`, `amx_int8`, and `avx512_bf16`. ```bash lscpu | grep -E "Model name|Socket|Core|Thread|NUMA node|Flags" @@ -48,157 +104,209 @@ numactl --hardware ## Fast Path: Docker +
+(Optional) Install Docker on Ubuntu 24.04 + ```bash -export VLLM_VERSION=0.20.2 +sudo apt-get update +sudo apt-get install -y docker.io +sudo systemctl enable --now docker +sudo usermod -aG docker $USER +newgrp docker # apply group without re-login +``` + +
+ +```bash +export HF_TOKEN=your_hf_token_here # <<<=== Required for gated Hugging Face models and faster downloads. +export VLLM_VERSION=0.20.2 # <<<=== Update this for newer releases! Check! docker pull vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64 docker run --rm \ --name vllm-cpu \ --security-opt seccomp=unconfined \ --cap-add SYS_NICE \ - --shm-size=4g \ + --shm-size=8g \ -p 8000:8000 \ - -v ~/.cache/huggingface:/root/.cache/huggingface \ -e HF_TOKEN="${HF_TOKEN}" \ - -e VLLM_CPU_KVCACHE_SPACE=40 \ + -e VLLM_CPU_KVCACHE_SPACE=20 \ -e VLLM_CPU_OMP_THREADS_BIND=auto \ -e VLLM_CPU_NUM_OF_RESERVED_CPU=1 \ vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64 \ - ibm-granite/granite-3.0-3b-a800m-instruct \ + RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8 \ --dtype=bfloat16 \ --max-num-batched-tokens 2048 \ - --max-num-seqs 64 + --max-num-seqs 128 ``` -Use the `-x86_64` CPU image tag when pinning a version; `latest-x86_64` is the unpinned option. - `SYS_NICE` and `seccomp=unconfined` allow vLLM's NUMA memory policy calls inside Docker. Without them, serving can still work, but NUMA placement may be weaker and logs can show `get_mempolicy: Operation not permitted`. -Validate the OpenAI-compatible endpoint: +## Validate the OpenAI-compatible endpoint + +**Open a new terminal or use a remote system.** ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "ibm-granite/granite-3.0-3b-a800m-instruct", + "model": "RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8", "messages": [{"role": "user", "content": "Give three CPU inference tuning tips."}], "max_tokens": 128 }' ``` -## Python Wheel Install +## Benchmarking Guidance -Use this path when you need a local Python environment instead of Docker. +This summarizes the official benchmarking and tuning guidance from the vLLM documentation, with a CPU focus. Always consult the [official benchmarking docs](https://docs.vllm.ai/en/latest/benchmarking/cli/) for the latest recommendations and tools. + +Start the Docker. If it is running in the foreground, open another terminal for these checks: ```bash -uv venv --python 3.12 --seed --managed-python -source .venv/bin/activate +# Docker path, because the server above is named vllm-cpu. +docker exec vllm-cpu vllm collect-env -export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//') -uv pip install \ - "https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl" \ - --torch-backend cpu +sudo curl -s http://localhost:8000/v1/models | jq . +SERVER_PID=$(pgrep -f 'vllm serve|api_server' | head -n 1) +numastat -p "${SERVER_PID}" ``` -For latest main-branch CPU wheels: +### Benchmark -```bash -uv pip install vllm \ - --extra-index-url https://wheels.vllm.ai/nightly/cpu \ - --index-strategy first-index \ - --torch-backend cpu -``` +Use `vllm bench serve` to measure TTFT, TPOT, and throughput against the running server. Warm up with `--num-warmups` to avoid measuring JIT compilation overhead. -Before serving from CPU wheels or source builds, preload TCMalloc and Intel OpenMP: +If you started the server with Docker (as shown above), run the benchmark **inside the container**: ```bash -sudo apt-get update -sudo apt-get install -y --no-install-recommends libtcmalloc-minimal4 - -TC_PATH=$(find /usr -name 'libtcmalloc_minimal.so.4' 2>/dev/null | head -n 1) -IOMP_PATH=$(find .venv -name 'libiomp5.so' 2>/dev/null | head -n 1) -export LD_PRELOAD="${TC_PATH}:${IOMP_PATH}:${LD_PRELOAD}" +docker exec vllm-cpu vllm bench serve \ + --model RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8 \ + --dataset-name random \ + --random-input-len 128 \ + --random-output-len 128 \ + --num-prompts 100 \ + --num-warmups 5 \ + --request-rate inf \ + --save-result \ + --result-dir ./bench-results \ + --percentile-metrics ttft,tpot,itl ``` -Run locally: +If you installed vLLM natively (via `pip install vllm`), run directly on the host: ```bash -export VLLM_CPU_KVCACHE_SPACE=40 -export VLLM_CPU_OMP_THREADS_BIND=auto -export VLLM_CPU_NUM_OF_RESERVED_CPU=1 - -vllm serve ibm-granite/granite-3.0-3b-a800m-instruct \ - --device cpu \ - --dtype=bfloat16 \ - --max-num-batched-tokens 2048 \ - --max-num-seqs 64 +vllm bench serve \ + --model RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8 \ + --dataset-name random \ + --random-input-len 128 \ + --random-output-len 128 \ + --num-prompts 100 \ + --num-warmups 5 \ + --request-rate inf \ + --save-result \ + --result-dir ./bench-results \ + --percentile-metrics ttft,tpot,itl ``` -## Source Build Escape Hatch +> **Troubleshooting: "Failed to infer device type"** — This error means vLLM's platform detection cannot find the CPU backend. The most common cause is installing the generic (CUDA) wheel from PyPI via `pip install vllm` instead of the CPU-specific wheel. The CPU wheel includes `+cpu` in its version string (e.g., `0.20.2+cpu`), which the platform detector requires. Fix by reinstalling the CPU wheel directly: +> +> ```bash +> export VLLM_VERSION=0.20.2 +> pip install --force-reinstall --extra-index-url https://download.pytorch.org/whl/cpu \ +> "https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl" +> ``` -Use source only when CPU wheels/images are unavailable or you need a local vLLM change. +### Concurrency Sweep -```bash -sudo apt-get update -sudo apt-get install -y gcc-12 g++-12 libnuma-dev +With `--request-rate inf`, all prompts fire simultaneously so `--num-prompts` directly controls concurrency. Sweep to see how latency and throughput scale under increasing batch pressure: -git clone https://github.com/vllm-project/vllm.git vllm_source -cd vllm_source - -uv venv --python 3.12 --seed --managed-python -source .venv/bin/activate -uv pip install -r requirements/build/cpu.txt --torch-backend cpu -uv pip install -r requirements/cpu.txt --torch-backend cpu -VLLM_TARGET_DEVICE=cpu uv pip install . --no-build-isolation +```bash +for N in 10 50 100 200 500; do + vllm bench serve \ + --model RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8 \ + --dataset-name random \ + --random-input-len 128 \ + --random-output-len 128 \ + --num-prompts "${N}" \ + --num-warmups 5 \ + --request-rate inf \ + --save-result \ + --result-dir ./bench-results \ + --percentile-metrics ttft,tpot,itl +done ``` -If CMake detects CUDA during a CPU build, add `CMAKE_DISABLE_FIND_PACKAGE_CUDA=ON`. If NumPy breaks imports, pin `numpy<2.0`. +### Testing & Tuning Methodology -## Model and Quantization Notes +- Test with different input/output lengths to understand how the model performs under different prompt and generation sizes. For example, try `--random-input-len` and `--random-output-len` values of `64`, `128`, `256`, and `512`. +- Test with different user concurrency levels using `--num-prompts` values of `10`, `50`, `100`, `200`, and `500` with `--request-rate inf`. +- Use one known-good model and change one knob at a time. Track TTFT, TPOT, output tokens per second, requests per second, peak RSS, NUMA locality, and OOM events. +- Vary only one of `VLLM_CPU_KVCACHE_SPACE`, `VLLM_CPU_OMP_THREADS_BIND`, `--max-num-batched-tokens`, `--max-num-seqs`, or `--block-size` per run. Compare results across runs using the saved JSON files in `./bench-results`. -- This README uses `ibm-granite/granite-3.0-3b-a800m-instruct` as a compact Granite MoE example: 3.3B total parameters with about 800M active parameters per token. -- Check the official [CPU-supported model list](https://docs.vllm.ai/en/stable/models/hardware_supported_models/cpu/) before choosing a model. -- Prefer BF16 first; compare INT8 only after functional quality is acceptable. -- CPU quantization support includes AWQ and GPTQ on x86, plus compressed-tensors INT8 W8A8 on x86 and s390x. -- INT8 can reduce weight memory and bandwidth pressure, especially during decode-heavy serving. +### Using the vLLM Benchmark Suite -## Validate and Tune +The vLLM source tree includes a full performance benchmark harness at `.buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh`. This is the same script used in vLLM's CI to gate regressions. It reads a JSON test definition, generates concrete benchmark commands, and (optionally) executes them. -Start the Docker or Python server first. If it is running in the foreground, open another terminal for these checks: +Prepare the environment and run a dry-run first to inspect the generated commands without executing them: ```bash -# Docker path, because the server above is named vllm-cpu. -docker exec vllm-cpu vllm collect-env - -# Python wheel or source-build path. -vllm collect-env - -curl -s http://localhost:8000/v1/models | jq . -SERVER_PID=$(pgrep -f 'vllm serve|api_server' | head -n 1) -htop -numastat -p "${SERVER_PID}" +export HF_TOKEN=your_hf_token_here # <<<=== Required for gated Hugging Face models and faster downloads. +export VLLM_VERSION=0.20.2 +python3 -m venv ~/vllm-venv +source ~/vllm-venv/bin/activate +pip install --extra-index-url https://download.pytorch.org/whl/cpu \ + "https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl" \ + tabulate pandas ``` -Use one known-good model and change one knob at a time. Track TTFT, TPOT, output tokens per second, requests per second, peak RSS, NUMA locality, and OOM events. - -The vLLM Benchmark Suite lives in the vLLM source tree. Clone it, or reuse the source checkout from the source-build path, even if vLLM is already installed. Exact `MODEL_FILTER` values must exist in the CPU test JSON; the current suite includes `ibm-granite/granite-3.2-2b-instruct` as the pre-curated Granite profile. +Clone the source tree (or reuse the checkout from a source build): ```bash git clone https://github.com/vllm-project/vllm.git vllm_source cd vllm_source +export VLLM_TARGET_DEVICE=cpu +``` + +Run a dry-run first to inspect the generated commands without executing them: -ON_CPU=1 SERVING_JSON=serving-tests-cpu-text.json DRY_RUN=1 \ - MODEL_FILTER=ibm-granite/granite-3.2-2b-instruct DTYPE_FILTER=bfloat16 \ +```bash +source ~/vllm-venv/bin/activate +HF_TOKEN="${HF_TOKEN}" \ +ON_CPU=1 \ +SERVING_JSON=serving-tests-cpu-text.json \ +DRY_RUN=1 \ +MODEL_FILTER=meta-llama/Llama-3.1-8B-Instruct \ +DTYPE_FILTER=bfloat16 \ bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh +``` +To execute the benchmark (remove `DRY_RUN=1`): -find benchmark/results -maxdepth 2 -name "*.commands" -print +```bash +source ~/vllm-venv/bin/activate +HF_TOKEN="${HF_TOKEN}" \ +ON_CPU=1 \ +SERVING_JSON=serving-tests-cpu-text.json \ +MODEL_FILTER=meta-llama/Llama-3.1-8B-Instruct \ +DTYPE_FILTER=bfloat16 \ + bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh ``` -Use the generated `.commands` files as the baseline. When moving back to the Granite MoE serving model, replace the model ID and change only one of `VLLM_CPU_KVCACHE_SPACE`, `VLLM_CPU_OMP_THREADS_BIND`, `--max-num-batched-tokens`, `--max-num-seqs`, or `--block-size` per run. +Key environment variables: + +| Variable | Purpose | +| -------- | ------- | +| `HF_TOKEN` | Hugging Face token — required by the script's `check_hf_token` gate | +| `ON_CPU` | Set to `1` to use CPU-specific test configs | +| `SERVING_JSON` | JSON file defining test matrix (e.g., `serving-tests-cpu-text.json`) | +| `DRY_RUN` | Set to `1` to generate commands without executing | +| `MODEL_FILTER` | Run only benchmarks matching this model ID | +| `DTYPE_FILTER` | Run only benchmarks matching this dtype (e.g., `bfloat16`) | + +> **Note:** The `MODEL_FILTER` value must match an entry in the JSON test definition. If the model is not pre-curated in the CPU test JSON, you can add an entry or use the `vllm bench serve` approach above instead. ## References - [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/) - [vLLM CPU-supported models](https://docs.vllm.ai/en/stable/models/hardware_supported_models/cpu/) - [vLLM optimization and tuning guide](https://docs.vllm.ai/en/stable/configuration/optimization/) +- [vLLM bench serve CLI](https://docs.vllm.ai/en/latest/cli/bench/serve.html) +- [vLLM bench latency CLI](https://docs.vllm.ai/en/latest/cli/bench/latency.html) - [vLLM Intel quantization support](https://docs.vllm.ai/en/stable/features/quantization/inc/) diff --git a/software/vllm/skill/SKILL.md b/software/vllm/skill/SKILL.md new file mode 100644 index 0000000..a66278f --- /dev/null +++ b/software/vllm/skill/SKILL.md @@ -0,0 +1,183 @@ +--- +name: vllm-xeon-cpu +description: "Deploy, tune, validate, and benchmark vLLM on Intel Xeon CPUs (CPU-only inference, no GPU). USE FOR: serving and performance optimizing LLMs on Intel Xeon, vLLM CPU install, CPU inference tuning, AMX bfloat16 setup, NUMA pinning, VLLM_CPU_KVCACHE_SPACE, VLLM_CPU_OMP_THREADS_BIND, --dtype=bfloat16, vllm/vllm-openai-cpu Docker image, hardware validation for AMX (amx_tile, amx_bf16, amx_int8), KV cache sizing per NUMA node, --max-num-batched-tokens / --max-num-seqs tuning, vllm bench serve on CPU, TTFT/TPOT measurement. DO NOT USE FOR: GPU vLLM (use upstream vLLM docs), training, quantization tuning beyond INT8/AWQ pointers, model architecture selection (use Intel Xeon AI Performance Advisor), non-Xeon CPUs, vLLM source build deep-dives." +--- + +# vLLM on Intel Xeon CPUs + +- **Skill version**: 1.0 +- **Tested against vLLM**: `v0.20.2` +- **Minimum vLLM**: `v0.17.0` + +## Upstream First + +Intel upstreams Xeon CPU optimizations directly to vLLM. This skill encodes deployment, tuning, validation, and a short benchmarking walkthrough — always consult upstream for the latest: + +- [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/) +- [vLLM SLM/LLM Recipes](https://recipes.vllm.ai/) +- [vLLM optimization and tuning](https://docs.vllm.ai/en/stable/configuration/optimization/) +- [vLLM bench serve CLI](https://docs.vllm.ai/en/latest/cli/bench/serve.html) + +## When to Use + +Invoke this skill when the user wants to: +- Deploy or serve vLLM on an Intel Xeon CPU (no GPU). +- Tune CPU-serving performance knobs (KV cache, OMP bind, batched tokens, num seqs). +- Validate Xeon hardware (AMX flags, NUMA topology) before deploying. +- Run a minimal CPU benchmark to measure TTFT / TPOT / throughput. + +**Do not use** for GPU vLLM, model training, deep quantization tuning, model selection (point users at the [Intel Xeon AI Performance Advisor](https://xeonprocessoradvisor.intel.com/csp-ai-performance-advisor)), or non-Xeon CPUs. + +## Prerequisites + +| Item | Requirement | +| --- | --- | +| OS | Linux | +| Python | 3.10–3.13 (only if not using Docker) | +| CPU | 4th Gen Intel Xeon or newer; must expose `amx_tile`, `amx_bf16`, `amx_int8` for best BF16/INT8 performance | +| Tools | `curl`, `numactl`, `jq`, `g++`, `python3-dev` (`sudo apt-get install -y --no-install-recommends curl git jq numactl htop g++ python3-dev`). `g++` and Python headers are required by PyTorch inductor to JIT-compile CPU kernels. | +| Docker | Recent Docker with `--cap-add SYS_NICE` and `--security-opt seccomp=unconfined` permitted | + +## Procedure 1 — Validate Hardware + +Goal: confirm Xeon generation, AMX support, and NUMA topology before deploying. + +1. Inspect CPU model, sockets, cores, threads, NUMA nodes: + ```bash + lscpu | grep -E "Model name|Socket|Core|Thread|NUMA node" + ``` +2. Check for AMX and AVX-512 flags: + ```bash + lscpu | grep -E "amx_(tile|bf16|int8)|avx512_bf16|avx512f|avx2" + ``` + - **All of `amx_tile`, `amx_bf16`, `amx_int8` present** → proceed; BF16 AMX kernels will activate. + - **AMX missing** --> Warn user. vLLM will still run, but inference throughput will be substantially lower. Recommend a 4th Gen Xeon (Sapphire Rapids) or newer. +3. Inspect NUMA topology: + ```bash + numactl --hardware + ``` + Record the NUMA node count `N` — it drives `--tensor-parallel-size` and KV cache sizing. + +## Procedure 2 — Deploy (Docker Fast Path) + +Goal: serve a model via the official `vllm/vllm-openai-cpu` image with Xeon-tuned env vars. + +0. (Optional) Install Docker if not present (Ubuntu 24.04): + ```bash + sudo apt-get update + sudo apt-get install -y docker.io + sudo systemctl enable --now docker + sudo usermod -aG docker $USER + newgrp docker # apply group without re-login + ``` +1. Pin a release tag (do not use `latest-x86_64` in production): + ```bash + export VLLM_VERSION=0.20.2 # update to the latest release that meets the minimum above + docker pull vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64 + ``` +2. Run the container with the required Xeon env vars and Docker capabilities: + ```bash + docker run --rm \ + --name vllm-cpu \ + --security-opt seccomp=unconfined \ + --cap-add SYS_NICE \ + --shm-size=8g \ + -p 8000:8000 \ + -e HF_TOKEN="${HF_TOKEN}" \ + -e VLLM_CPU_KVCACHE_SPACE=40 \ + -e VLLM_CPU_OMP_THREADS_BIND=auto \ + -e VLLM_CPU_NUM_OF_RESERVED_CPU=1 \ + vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64 \ + \ + --dtype=bfloat16 \ + --max-num-batched-tokens 2048 \ + --max-num-seqs 128 + ``` + - `SYS_NICE` + `seccomp=unconfined` enable vLLM's NUMA memory-policy calls. Without them serving still works but logs may show `get_mempolicy: Operation not permitted` and NUMA placement weakens. + - `VLLM_CPU_KVCACHE_SPACE` is in GiB **per NUMA node** — must fit in node-local memory. + - `VLLM_CPU_OMP_THREADS_BIND=auto` binds OpenMP workers to NUMA-local cores. For manual control use ranges like `0-31|32-63`. + - `VLLM_CPU_NUM_OF_RESERVED_CPU=1` keeps a core free for API serving, tokenization, networking, and OS work. +3. Validate the OpenAI-compatible endpoint: + ```bash + curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "Give three CPU inference tuning tips."}], + "max_tokens": 128 + }' + ``` + +## Procedure 3 — Tune Performance Knobs + +Goal: improve TTFT / TPOT / throughput methodically. **Change one knob per run** and compare. + +1. Pick the use case: + - **Online serving** → start `--max-num-batched-tokens 2048`, `--max-num-seqs 128`. + - **Offline batch** → start `--max-num-batched-tokens 4096`, `--max-num-seqs 256`. +2. Set `--tensor-parallel-size`: + - Single NUMA node → leave default. + - Multi NUMA → set to the NUMA node count `N` from Procedure 1. + - **`--tensor-parallel-size=6` is currently unsupported on CPU; avoid it.** +3. Size `VLLM_CPU_KVCACHE_SPACE` (GiB per NUMA node): + - Larger value → more concurrency / longer context, but must fit in node-local RAM. + - If the server OOMs or pages, halve and retry. +4. Bind OpenMP threads with `VLLM_CPU_OMP_THREADS_BIND`: + - Prefer `auto`. Use manual ranges (`0-31|32-63`) only when `auto` mis-pins (verify with `numastat -p $(pgrep -f 'vllm serve|api_server' | head -n1)`). +5. (Experimental) Low-latency small-batch serving: + - `VLLM_CPU_SGL_KERNEL=1` enables x86 small-batch kernels. Requires AMX, BF16 weights, and compatible shapes. +6. Quantization (when functional quality is acceptable): + - Try INT8 or AWQ to reduce weight memory and memory-bandwidth pressure. Validate quality before promoting. + +Full knob reference: [tuning matrix](./references/tuning-matrix.md). + +## Procedure 4 — Benchmark (Minimal CPU Walkthrough) + +Goal: measure TTFT, TPOT, and throughput against the running server with a reproducible warm-up. + +1. Confirm the server is reachable and inspect environment: + ```bash + curl -s http://localhost:8000/v1/models | jq . + docker exec vllm-cpu vllm collect-env # or `vllm collect-env` for native installs + SERVER_PID=$(pgrep -f 'vllm serve|api_server' | head -n 1) + numastat -p "${SERVER_PID}" # verify NUMA locality + ``` +2. Run `vllm bench serve` with warm-ups (warm-ups avoid measuring JIT/compile overhead): + ```bash + vllm bench serve \ + --model \ + --dataset-name random \ + --random-input-len 128 \ + --random-output-len 128 \ + --num-prompts 100 \ + --num-warmups 5 \ + --request-rate inf \ + --save-result \ + --result-dir ./bench-results \ + --percentile-metrics ttft,tpot,itl + ``` +3. Sweep methodically — **one variable per run**: + - Vary input/output lengths: `64`, `128`, `256`, `512`. + - Vary concurrency via `--num-prompts`: `10`, `100`, `1000`. + - Track TTFT, TPOT, output tokens/sec, requests/sec, peak RSS, NUMA locality, and any OOM events. + - When tuning, change only one of `VLLM_CPU_KVCACHE_SPACE`, `VLLM_CPU_OMP_THREADS_BIND`, `--max-num-batched-tokens`, `--max-num-seqs`, or `--block-size` per run. Compare results across runs using the saved JSON files in `./bench-results`. +4. For the full CI-grade harness (`run-performance-benchmarks.sh` with `ON_CPU=1`, `SERVING_JSON`, `DRY_RUN`, `MODEL_FILTER`, `DTYPE_FILTER`), see the upstream [vLLM benchmarking docs](https://docs.vllm.ai/en/latest/benchmarking/cli/). + +## Output the Agent Should Produce + +After running these procedures, return to the user: +- The hardware validation summary (Xeon generation, AMX flags present, NUMA node count). +- The exact `docker run` command used, with values chosen for their hardware. +- Any tuning recommendations with the **one knob changed** per recommendation and the expected metric impact. +- Benchmark numbers (TTFT, TPOT, throughput) with the corresponding configuration. + +## References + +- [Tuning matrix](./references/tuning-matrix.md) — full env-var / CLI knob table with guard rails. +- [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/) +- [vLLM CPU-supported models](https://docs.vllm.ai/en/stable/models/hardware_supported_models/cpu/) +- [vLLM optimization and tuning guide](https://docs.vllm.ai/en/stable/configuration/optimization/) +- [vLLM Intel quantization support](https://docs.vllm.ai/en/stable/features/quantization/inc/) +- [Intel Xeon AI Performance Advisor (cloud)](https://xeonprocessoradvisor.intel.com/csp-ai-performance-advisor) +- [Intel Xeon AI Performance Advisor (on-prem)](https://xeonprocessoradvisor.intel.com/on-prem-ai-performance-advisor) +- [Intel AI Software Catalog — Model Guidance](https://swcatalog.intel.com/models) diff --git a/software/vllm/skill/references/tuning-matrix.md b/software/vllm/skill/references/tuning-matrix.md new file mode 100644 index 0000000..5467e32 --- /dev/null +++ b/software/vllm/skill/references/tuning-matrix.md @@ -0,0 +1,34 @@ +# vLLM Xeon CPU Tuning Matrix + +Full reference for environment variables and CLI flags relevant to vLLM CPU serving on Intel Xeon, with guard rails. Use alongside [SKILL.md](../SKILL.md) Procedure 3. + +## Environment Variables + +| Variable | Recommended | Guidance | Why it matters | +| --- | --- | --- | --- | +| `VLLM_CPU_KVCACHE_SPACE` | `40` (GiB) | Per **NUMA node**. Increase for more concurrency / longer context; must fit in node-local memory. Halve if the server OOMs or pages. | KV cache is the dominant CPU memory consumer; under-sizing throttles batching, over-sizing causes paging or OOM. | +| `VLLM_CPU_OMP_THREADS_BIND` | `auto` | Binds OpenMP workers to NUMA-local cores. Manual ranges look like `0-31\|32-63` (one range per NUMA node). Verify with `numastat -p `. | Cross-NUMA memory traffic kills decode throughput. | +| `VLLM_CPU_NUM_OF_RESERVED_CPU` | `1` | Reserves cores for the API server, tokenization, networking, logging, and OS work. Raise on noisy hosts. | Prevents OS / serving overhead from preempting OMP workers. | +| `VLLM_CPU_SGL_KERNEL` | `0` (try `1` for low-latency SLM) | Experimental x86 small-batch kernels. Requires AMX, BF16 weights, and compatible shapes. | Can reduce latency for small-batch serving, but is shape-sensitive. | +| `HF_TOKEN` | *(secret)* | Required for gated Hugging Face models. | Authentication. | + +## CLI Flags (`vllm serve` / Docker CMD) + +| Flag | Recommended | Guidance | Why it matters | +| --- | --- | --- | --- | +| `--dtype=bfloat16` | always on AMX-capable Xeon | Enables AMX BF16 kernels — the preferred CPU dtype. | Largest single performance lever on 4th Gen+ Xeon. | +| `--tensor-parallel-size` | default for single NUMA; `N` for `N` NUMA nodes | Keeps shards local to NUMA memory. **`6` is currently unsupported on CPU.** | Wrong value forces cross-NUMA traffic or fails to start. | +| `--max-num-batched-tokens` | `2048` online / `4096` offline | Cap on batched tokens per iteration. Higher → better prefill throughput, worse TTFT. | Tradeoff between TTFT and prefill throughput. | +| `--max-num-seqs` | `128` online / `256` offline | Cap on concurrent sequences. Higher → better decode throughput, worse ITL. | Tradeoff between ITL and decode throughput. | +| `--block-size` | leave default until baseline is recorded | Tune only after KV cache / OMP / batched-tokens are stable. | Interacts with KV cache layout; change last. | + +## Guard Rails + +- **One knob per run.** Vary only one of `VLLM_CPU_KVCACHE_SPACE`, `VLLM_CPU_OMP_THREADS_BIND`, `--max-num-batched-tokens`, `--max-num-seqs`, or `--block-size` between benchmark runs. Save results to JSON and compare. +- **Per-NUMA fit.** `VLLM_CPU_KVCACHE_SPACE` is per NUMA node — total memory consumption is `value × NUMA_node_count`. Confirm against `numactl --hardware` output. +- **NUMA locality check.** After starting the server: `numastat -p $(pgrep -f 'vllm serve|api_server' | head -n1)`. Memory should be concentrated on the expected node(s); large `other_node` numbers indicate mis-binding. +- **Docker capabilities.** Without `--cap-add SYS_NICE` and `--security-opt seccomp=unconfined`, vLLM cannot set NUMA memory policy; you will see `get_mempolicy: Operation not permitted` in logs and weaker placement. +- **Unsupported TP.** `--tensor-parallel-size=6` is currently unsupported on CPU. Use `2`, `4`, or `8` depending on socket / NUMA layout. +- **Reserved cores.** With `VLLM_CPU_NUM_OF_RESERVED_CPU=1`, OMP workers will land on the remaining cores. If serving latency spikes under load, raise the reserved count before re-running benchmarks. +- **AMX absent.** If `lscpu` does not list `amx_tile`, `amx_bf16`, `amx_int8`, BF16 throughput collapses to AVX-512 paths. Warn the user and recommend 4th Gen Xeon or newer instead of further tuning. +- **Quantization order.** Validate functional quality with BF16 first; only then evaluate INT8 / AWQ to reduce weight memory and bandwidth. From de30726a698173eb5e2751f2fcb0825b69cec4cb Mon Sep 17 00:00:00 2001 From: lucasmelogithub Date: Fri, 29 May 2026 10:04:49 -0500 Subject: [PATCH 7/8] Fix title --- software/vllm/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/software/vllm/README.md b/software/vllm/README.md index 0138f3b..b3df362 100644 --- a/software/vllm/README.md +++ b/software/vllm/README.md @@ -1,4 +1,4 @@ -# vLLM on Intel Xeon CPUs +# vLLM on Intel Xeon Processors This guide provides guidance on running vLLM on Intel Xeon processors. From 14d64c5e9607ef0ce58a8ed0d0c00bec5623c54b Mon Sep 17 00:00:00 2001 From: lucasmelogithub Date: Fri, 29 May 2026 10:36:26 -0500 Subject: [PATCH 8/8] Fix wording --- software/vllm/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/software/vllm/README.md b/software/vllm/README.md index b3df362..9a473be 100644 --- a/software/vllm/README.md +++ b/software/vllm/README.md @@ -1,6 +1,6 @@ # vLLM on Intel Xeon Processors -This guide provides guidance on running vLLM on Intel Xeon processors. +This guide provides recommendations for running vLLM on Intel Xeon processors. ## Upstream First