Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 39 additions & 73 deletions demos/continuous_batching/rag/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# RAG demo with OpenVINO Model Server {#ovms_demos_continuous_batching_rag}

## Creating models repository for all the endpoints with ovms --pull or python export_model.py script
## Creating models repository for all the endpoints

### 1. Download the preconfigured models using ovms --pull option from [HugginFaces Hub OpenVINO organization](https://huggingface.co/OpenVINO) (Simple usage)
::::{tab-set}

:::{tab-item} With Docker
Expand All @@ -20,110 +19,77 @@ docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/mo
```
:::

:::{tab-item} On Baremetal Host
:::{tab-item} On Baremetal Windows
**Required:** OpenVINO Model Server package - see [deployment instructions](../../../docs/deploying_server_baremetal.md) for details.

```bat
mkdir models

ovms --pull --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --task text_generation
ovms --pull --model_repository_path models --source_model OpenVINO/bge-base-en-v1.5-fp16-ov --task embeddings
ovms --pull --model_repository_path models --source_model OpenVINO/bge-reranker-base-fp16-ov --task rerank
ovms --pull --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --task text_generation --target_device GPU
ovms --pull --model_repository_path models --source_model OpenVINO/bge-base-en-v1.5-fp16-ov --task embeddings --target_device GPU
ovms --pull --model_repository_path models --source_model OpenVINO/bge-reranker-base-fp16-ov --task rerank --target_device GPU

ovms --add_to_config --config_path models/config.json --model_name OpenVINO/Qwen3-8B-int4-ov --model_path OpenVINO/Qwen3-8B-int4-ov
ovms --add_to_config --config_path models/config.json --model_name OpenVINO/bge-base-en-v1.5-fp16-ov --model_path OpenVINO/bge-base-en-v1.5-fp16-ov
ovms --add_to_config --config_path models/config.json --model_name OpenVINO/bge-reranker-base-fp16-ov --model_path OpenVINO/bge-reranker-base-fp16-ov
```
:::
::::

:::{tab-item} Windows service
**Required:** OpenVINO Model Server package - see [deployment instructions](../../../docs/deploying_server_baremetal.md) for details.
**Assumption:** install_ovms_service.bat was called without additional parameters - using default c:\models config path.
```bat
mkdir c:\models

ovms --pull --model_repository_path c:\models --source_model OpenVINO/Qwen3-8B-int4-ov --task text_generation
ovms --pull --model_repository_path c:\models --source_model OpenVINO/bge-base-en-v1.5-fp16-ov --task embeddings
ovms --pull --model_repository_path c:\models --source_model OpenVINO/bge-reranker-base-fp16-ov --task rerank
> NOTE: If you want to deploy models in pytorch format you can use the built-in OVMS optimum-cli functionality of `openvino/model_server:latest-py` described in [pull mode with optimum cli](../../../docs/pull_optimum_cli.md)

ovms --add_to_config --config_path c:\models\config.json --model_name OpenVINO/Qwen3-8B-int4-ov --model_path OpenVINO/Qwen3-8B-int4-ov
ovms --add_to_config --config_path c:\models\config.json --model_name OpenVINO/bge-base-en-v1.5-fp16-ov --model_path OpenVINO/bge-base-en-v1.5-fp16-ov
ovms --add_to_config --config_path c:\models\config.json --model_name OpenVINO/bge-reranker-base-fp16-ov --model_path OpenVINO/bge-reranker-base-fp16-ov
```
:::
::::
::::
> NOTE: You can also use [the windows service](../../../docs/windows_service.md) setup for the ease of use and shorter commands - with default model_repository_path and config_path

## Deploying the model server

### 2. Download the preconfigured models using ovms --pull option for models outside [HugginFaces Hub OpenVINO organization](https://huggingface.co/OpenVINO) in HuggingFace Hub. (Advanced usage)
::::{tab-set}

:::{tab-item} With Docker
**Required:** Docker Engine installed
```bash
mkdir models
docker run --user $(id -u):$(id -g) -e HF_HOME=/hf_home/cache --rm -v $(pwd)/models:/models:rw -v /opt/home/user/.cache/huggingface/:/hf_home/cache openvino/model_server:latest-py --pull --model_repository_path /models --source_model meta-llama/Meta-Llama-3-8B-Instruct --task text_generation --weight-format int8
docker run --user $(id -u):$(id -g) -e HF_HOME=/hf_home/cache --rm -v $(pwd)/models:/models:rw -v /opt/home/user/.cache/huggingface/:/hf_home/cache openvino/model_server:latest-py --pull --model_repository_path /models --source_model Alibaba-NLP/gte-large-en-v1.5 --task embeddings --weight-format int8
docker run --user $(id -u):$(id -g) -e HF_HOME=/hf_home/cache --rm -v $(pwd)/models:/models:rw -v /opt/home/user/.cache/huggingface/:/hf_home/cache openvino/model_server:latest-py --pull --model_repository_path /models --source_model BAAI/bge-reranker-large --task rerank --weight-format int8

docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:latest-py --add_to_config --config_path /models/config.json --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path meta-llama/Meta-Llama-3-8B-Instruct --weight-format int8
docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:latest-py --add_to_config --config_path /models/config.json --model_name Alibaba-NLP/gte-large-en-v1.5 --model_path Alibaba-NLP/gte-large-en-v1.5 --weight-format int8
docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:latest-py --add_to_config --config_path /models/config.json --model_name BAAI/bge-reranker-large --model_path BAAI/bge-reranker-large --weight-format int8
docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config.json
```
:::

:::{tab-item} On Baremetal Host
**Required:** OpenVINO Model Server package - see [deployment instructions](../../../docs/deploying_server_baremetal.md) for details.

:::{tab-item} On Baremetal Windows
```bat
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt
pip3 install -q -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/continuous_batching/rag/requirements.txt
mkdir models
set HF_HOME=C:\hf_home\cache # export HF_HOME=/hf_home/cache if using linux
ovms --pull --model_repository_path models --source_model meta-llama/Meta-Llama-3-8B-Instruct --task text_generation --weight-format int8
ovms --pull --model_repository_path models --source_model Alibaba-NLP/gte-large-en-v1.5 --task embeddings --weight-format int8
ovms --pull --model_repository_path models --source_model BAAI/bge-reranker-large --task rerank --weight-format int8

ovms --add_to_config --config_path /models/config.json --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path meta-llama/Meta-Llama-3-8B-Instruct
ovms --add_to_config --config_path /models/config.json --model_name Alibaba-NLP/gte-large-en-v1.5 --model_path Alibaba-NLP/gte-large-en-v1.5
ovms --add_to_config --config_path /models/config.json --model_name BAAI/bge-reranker-large --model_path BAAI/bge-reranker-large
ovms --rest_port 8000 --config_path models\config.json
```
:::
::::

## Readiness Check

### 3. Export models from HuggingFace Hub including conversion to OpenVINO format using the python script

Use this procedure for all the models outside of OpenVINO organization in HuggingFace Hub.

Wait for the models to load. You can check the status with a simple command:
```console
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt

mkdir models
python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int8 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models
python export_model.py embeddings_ov --source_model Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 --config_file_path models/config.json
python export_model.py rerank_ov --source_model BAAI/bge-reranker-large --weight-format int8 --config_file_path models/config.json
```

## Deploying the model server

### With Docker
```bash
docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config.json
curl http://localhost:8000/v3/models
```
### On Baremetal Unix
```bash
ovms --rest_port 8000 --config_path models/config.json
```
### Windows
```bat
ovms --rest_port 8000 --config_path models\config.json
{
"data": [
{
"id": "OpenVINO/Qwen3-8B-int4-ov",
"object": "model",
"created": 1775552853,
"owned_by": "OVMS"
},
{
"id": "OpenVINO/bge-base-en-v1.5-fp16-ov",
"object": "model",
"created": 1775552853,
"owned_by": "OVMS"
},
{
"id": "OpenVINO/bge-reranker-base-fp16-ov",
"object": "model",
"created": 1775552853,
"owned_by": "OVMS"
}
],
"object": "list"
}
```

### Server as Windows Service
```bat
sc start ovms
```
## Using RAG

When the model server is deployed and serving all 3 endpoints, run the [jupyter notebook](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/rag/rag_demo.ipynb) to use RAG chain with a fully remote execution.
70 changes: 4 additions & 66 deletions demos/continuous_batching/rag/rag_demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,6 @@
"OpenVINO models:\n",
" `OpenVINO/Qwen3-8B-int4-ov` for `chat/completions` and `OpenVINO/bge-base-en-v1.5-fp16-ov` for `embeddings` and `OpenVINO/bge-reranker-base-fp16-ov` for `rerank` endpoint.\n",
"\n",
"or\n",
"Converted models:\n",
" `meta-llama/Meta-Llama-3-8B-Instruct` for `chat/completions` and `Alibaba-NLP/gte-large-en-v1.5` for `embeddings` and `BAAI/bge-reranker-large` for `rerank` endpoint. \n",
"\n",
"Check https://github.com/openvinotoolkit/model_server/tree/main/demos/continuous_batching/rag/README.md to see how they can be deployed.\n",
"LLM model, embeddings and rerank can be on hosted on the same model server instance or separately as needed.\n",
"openai_api_base , base_url parameters with the target url and model names in the commands might need to be adjusted. \n",
Expand Down Expand Up @@ -47,58 +43,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"id": "7212515f-b59b-498c-a66a-f6c59de8fcab",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "f97b31c1ba61476fa8d43eb48812691c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"RadioButtons(description='Radio Selector:', options=('OpenVINO models', 'Converted models'), value='OpenVINO m…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "ee7b21f697bf4063a90a985e26b1b3f2",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Text(value='OpenVINO models', disabled=True)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from ipywidgets import widgets, link\n",
"from IPython.display import display\n",
"options = [\"OpenVINO models\", \"Converted models\"]\n",
"\n",
"# Create the radio buttons and a text box for output\n",
"radio_button = widgets.RadioButtons(options=options, description='Radio Selector:')\n",
"output_text = widgets.Text(disabled=True)\n",
"\n",
"# Link the value of the radio buttons to the text box\n",
"link((radio_button, 'value'), (output_text, 'value'))\n",
"\n",
"# Display both widgets\n",
"display(radio_button, output_text)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"id": "b085cd3f-5473-474e-b35c-a1a548d50f0e",
"metadata": {},
"outputs": [
Expand All @@ -111,16 +56,9 @@
}
],
"source": [
"print(output_text.value)\n",
"if output_text.value == \"OpenVINO models\":\n",
" embeddings_model = \"OpenVINO/bge-base-en-v1.5-fp16-ov\"\n",
" rerank_model = \"OpenVINO/bge-reranker-base-fp16-ov\"\n",
" chat_model = \"OpenVINO/Qwen3-8B-int4-ov\"\n",
"else:\n",
" embeddings_model = \"Alibaba-NLP/gte-large-en-v1.5\"\n",
" rerank_model = \"BAAI/bge-reranker-large\"\n",
" chat_model = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n",
" "
"embeddings_model = \"OpenVINO/bge-base-en-v1.5-fp16-ov\"\n",
"rerank_model = \"OpenVINO/bge-reranker-base-fp16-ov\"\n",
"chat_model = \"OpenVINO/Qwen3-8B-int4-ov\" "
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions docs/pull_optimum_cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ ovms --pull --source_model "Qwen/Qwen3-4B" --model_repository_path /models --mod

```bash
mkdir -p models
docker run -u $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:latest-py --pull --source_model "Qwen/Qwen3-4B" --model_repository_path /models --model_name Qwen3-4B --task text_generation --weight-format int8
docker run -u $(id -u):$(id -g) -e HF_HOME=/tmp -e TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor --rm -v $(pwd)/models:/models:rw openvino/model_server:latest-py --pull --source_model "Qwen/Qwen3-4B" --model_repository_path /models --model_name Qwen3-4B --task text_generation --weight-format int8
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why TORCHINDUCTOR_CACHE_DIR ?
was this command tested?

```
:::

Expand Down Expand Up @@ -85,7 +85,7 @@ You can mount the HuggingFace cache to avoid downloading the original model in c
Below is an example pull command with optimum model cache directory sharing for model download:

```bash
docker run -v /etc/passwd:/etc/passwd -e HF_HOME=/hf_home/cache --user $(id -u):$(id -g) --group-add=$(id -g) -v ${HOME}/.cache/huggingface/:/hf_home/cache -v $(pwd)/models:/models:rw openvino/model_server:latest-py --pull --model_repository_path /models --source_model meta-llama/Llama-3.2-1B-Instruct --task text_generation --weight-format int8
docker run -e TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor -e HF_HOME=/hf_home/cache --user $(id -u):$(id -g) --group-add=$(id -g) -v ${HOME}/.cache/huggingface/:/hf_home/cache -v $(pwd)/models:/models:rw openvino/model_server:latest-py --pull --model_repository_path /models --source_model meta-llama/Llama-3.2-1B-Instruct --task text_generation --weight-format int8
```

or deploy without caching the model files with passed HF_TOKEN for authorization:
Expand Down