diff --git a/demos/continuous_batching/rag/README.md b/demos/continuous_batching/rag/README.md index a7951647de..fc7b9f578d 100644 --- a/demos/continuous_batching/rag/README.md +++ b/demos/continuous_batching/rag/README.md @@ -1,8 +1,7 @@ # RAG demo with OpenVINO Model Server {#ovms_demos_continuous_batching_rag} -## Creating models repository for all the endpoints with ovms --pull or python export_model.py script +## Creating models repository for all the endpoints -### 1. Download the preconfigured models using ovms --pull option from [HugginFaces Hub OpenVINO organization](https://huggingface.co/OpenVINO) (Simple usage) ::::{tab-set} :::{tab-item} With Docker @@ -20,110 +19,77 @@ docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/mo ``` ::: -:::{tab-item} On Baremetal Host +:::{tab-item} On Baremetal Windows **Required:** OpenVINO Model Server package - see [deployment instructions](../../../docs/deploying_server_baremetal.md) for details. ```bat mkdir models -ovms --pull --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --task text_generation -ovms --pull --model_repository_path models --source_model OpenVINO/bge-base-en-v1.5-fp16-ov --task embeddings -ovms --pull --model_repository_path models --source_model OpenVINO/bge-reranker-base-fp16-ov --task rerank +ovms --pull --model_repository_path models --source_model OpenVINO/Qwen3-8B-int4-ov --task text_generation --target_device GPU +ovms --pull --model_repository_path models --source_model OpenVINO/bge-base-en-v1.5-fp16-ov --task embeddings --target_device GPU +ovms --pull --model_repository_path models --source_model OpenVINO/bge-reranker-base-fp16-ov --task rerank --target_device GPU ovms --add_to_config --config_path models/config.json --model_name OpenVINO/Qwen3-8B-int4-ov --model_path OpenVINO/Qwen3-8B-int4-ov ovms --add_to_config --config_path models/config.json --model_name OpenVINO/bge-base-en-v1.5-fp16-ov --model_path OpenVINO/bge-base-en-v1.5-fp16-ov ovms --add_to_config --config_path models/config.json --model_name OpenVINO/bge-reranker-base-fp16-ov --model_path OpenVINO/bge-reranker-base-fp16-ov ``` ::: +:::: -:::{tab-item} Windows service -**Required:** OpenVINO Model Server package - see [deployment instructions](../../../docs/deploying_server_baremetal.md) for details. -**Assumption:** install_ovms_service.bat was called without additional parameters - using default c:\models config path. -```bat -mkdir c:\models -ovms --pull --model_repository_path c:\models --source_model OpenVINO/Qwen3-8B-int4-ov --task text_generation -ovms --pull --model_repository_path c:\models --source_model OpenVINO/bge-base-en-v1.5-fp16-ov --task embeddings -ovms --pull --model_repository_path c:\models --source_model OpenVINO/bge-reranker-base-fp16-ov --task rerank +> NOTE: If you want to deploy models in pytorch format you can use the built-in OVMS optimum-cli functionality of `openvino/model_server:latest-py` described in [pull mode with optimum cli](../../../docs/pull_optimum_cli.md) -ovms --add_to_config --config_path c:\models\config.json --model_name OpenVINO/Qwen3-8B-int4-ov --model_path OpenVINO/Qwen3-8B-int4-ov -ovms --add_to_config --config_path c:\models\config.json --model_name OpenVINO/bge-base-en-v1.5-fp16-ov --model_path OpenVINO/bge-base-en-v1.5-fp16-ov -ovms --add_to_config --config_path c:\models\config.json --model_name OpenVINO/bge-reranker-base-fp16-ov --model_path OpenVINO/bge-reranker-base-fp16-ov -``` -::: -:::: -:::: +> NOTE: You can also use [the windows service](../../../docs/windows_service.md) setup for the ease of use and shorter commands - with default model_repository_path and config_path + +## Deploying the model server -### 2. Download the preconfigured models using ovms --pull option for models outside [HugginFaces Hub OpenVINO organization](https://huggingface.co/OpenVINO) in HuggingFace Hub. (Advanced usage) ::::{tab-set} :::{tab-item} With Docker -**Required:** Docker Engine installed ```bash -mkdir models -docker run --user $(id -u):$(id -g) -e HF_HOME=/hf_home/cache --rm -v $(pwd)/models:/models:rw -v /opt/home/user/.cache/huggingface/:/hf_home/cache openvino/model_server:latest-py --pull --model_repository_path /models --source_model meta-llama/Meta-Llama-3-8B-Instruct --task text_generation --weight-format int8 -docker run --user $(id -u):$(id -g) -e HF_HOME=/hf_home/cache --rm -v $(pwd)/models:/models:rw -v /opt/home/user/.cache/huggingface/:/hf_home/cache openvino/model_server:latest-py --pull --model_repository_path /models --source_model Alibaba-NLP/gte-large-en-v1.5 --task embeddings --weight-format int8 -docker run --user $(id -u):$(id -g) -e HF_HOME=/hf_home/cache --rm -v $(pwd)/models:/models:rw -v /opt/home/user/.cache/huggingface/:/hf_home/cache openvino/model_server:latest-py --pull --model_repository_path /models --source_model BAAI/bge-reranker-large --task rerank --weight-format int8 - -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:latest-py --add_to_config --config_path /models/config.json --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path meta-llama/Meta-Llama-3-8B-Instruct --weight-format int8 -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:latest-py --add_to_config --config_path /models/config.json --model_name Alibaba-NLP/gte-large-en-v1.5 --model_path Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 -docker run --user $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:latest-py --add_to_config --config_path /models/config.json --model_name BAAI/bge-reranker-large --model_path BAAI/bge-reranker-large --weight-format int8 +docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config.json ``` ::: -:::{tab-item} On Baremetal Host -**Required:** OpenVINO Model Server package - see [deployment instructions](../../../docs/deploying_server_baremetal.md) for details. - +:::{tab-item} On Baremetal Windows ```bat -pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt -pip3 install -q -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/continuous_batching/rag/requirements.txt -mkdir models -set HF_HOME=C:\hf_home\cache # export HF_HOME=/hf_home/cache if using linux -ovms --pull --model_repository_path models --source_model meta-llama/Meta-Llama-3-8B-Instruct --task text_generation --weight-format int8 -ovms --pull --model_repository_path models --source_model Alibaba-NLP/gte-large-en-v1.5 --task embeddings --weight-format int8 -ovms --pull --model_repository_path models --source_model BAAI/bge-reranker-large --task rerank --weight-format int8 - -ovms --add_to_config --config_path /models/config.json --model_name meta-llama/Meta-Llama-3-8B-Instruct --model_path meta-llama/Meta-Llama-3-8B-Instruct -ovms --add_to_config --config_path /models/config.json --model_name Alibaba-NLP/gte-large-en-v1.5 --model_path Alibaba-NLP/gte-large-en-v1.5 -ovms --add_to_config --config_path /models/config.json --model_name BAAI/bge-reranker-large --model_path BAAI/bge-reranker-large +ovms --rest_port 8000 --config_path models\config.json ``` ::: :::: +## Readiness Check -### 3. Export models from HuggingFace Hub including conversion to OpenVINO format using the python script - -Use this procedure for all the models outside of OpenVINO organization in HuggingFace Hub. - +Wait for the models to load. You can check the status with a simple command: ```console -curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py -pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt - -mkdir models -python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int8 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models -python export_model.py embeddings_ov --source_model Alibaba-NLP/gte-large-en-v1.5 --weight-format int8 --config_file_path models/config.json -python export_model.py rerank_ov --source_model BAAI/bge-reranker-large --weight-format int8 --config_file_path models/config.json -``` - -## Deploying the model server - -### With Docker -```bash -docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config.json +curl http://localhost:8000/v3/models ``` -### On Baremetal Unix -```bash -ovms --rest_port 8000 --config_path models/config.json ``` -### Windows -```bat -ovms --rest_port 8000 --config_path models\config.json +{ + "data": [ + { + "id": "OpenVINO/Qwen3-8B-int4-ov", + "object": "model", + "created": 1775552853, + "owned_by": "OVMS" + }, + { + "id": "OpenVINO/bge-base-en-v1.5-fp16-ov", + "object": "model", + "created": 1775552853, + "owned_by": "OVMS" + }, + { + "id": "OpenVINO/bge-reranker-base-fp16-ov", + "object": "model", + "created": 1775552853, + "owned_by": "OVMS" + } + ], + "object": "list" +} ``` -### Server as Windows Service -```bat -sc start ovms -``` ## Using RAG When the model server is deployed and serving all 3 endpoints, run the [jupyter notebook](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/rag/rag_demo.ipynb) to use RAG chain with a fully remote execution. diff --git a/demos/continuous_batching/rag/rag_demo.ipynb b/demos/continuous_batching/rag/rag_demo.ipynb index 7bd1503e5a..732347e3bc 100644 --- a/demos/continuous_batching/rag/rag_demo.ipynb +++ b/demos/continuous_batching/rag/rag_demo.ipynb @@ -15,10 +15,6 @@ "OpenVINO models:\n", " `OpenVINO/Qwen3-8B-int4-ov` for `chat/completions` and `OpenVINO/bge-base-en-v1.5-fp16-ov` for `embeddings` and `OpenVINO/bge-reranker-base-fp16-ov` for `rerank` endpoint.\n", "\n", - "or\n", - "Converted models:\n", - " `meta-llama/Meta-Llama-3-8B-Instruct` for `chat/completions` and `Alibaba-NLP/gte-large-en-v1.5` for `embeddings` and `BAAI/bge-reranker-large` for `rerank` endpoint. \n", - "\n", "Check https://github.com/openvinotoolkit/model_server/tree/main/demos/continuous_batching/rag/README.md to see how they can be deployed.\n", "LLM model, embeddings and rerank can be on hosted on the same model server instance or separately as needed.\n", "openai_api_base , base_url parameters with the target url and model names in the commands might need to be adjusted. \n", @@ -47,58 +43,7 @@ }, { "cell_type": "code", - "execution_count": 2, - "id": "7212515f-b59b-498c-a66a-f6c59de8fcab", - "metadata": {}, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "f97b31c1ba61476fa8d43eb48812691c", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "RadioButtons(description='Radio Selector:', options=('OpenVINO models', 'Converted models'), value='OpenVINO m…" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "ee7b21f697bf4063a90a985e26b1b3f2", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Text(value='OpenVINO models', disabled=True)" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "from ipywidgets import widgets, link\n", - "from IPython.display import display\n", - "options = [\"OpenVINO models\", \"Converted models\"]\n", - "\n", - "# Create the radio buttons and a text box for output\n", - "radio_button = widgets.RadioButtons(options=options, description='Radio Selector:')\n", - "output_text = widgets.Text(disabled=True)\n", - "\n", - "# Link the value of the radio buttons to the text box\n", - "link((radio_button, 'value'), (output_text, 'value'))\n", - "\n", - "# Display both widgets\n", - "display(radio_button, output_text)" - ] - }, - { - "cell_type": "code", - "execution_count": 3, + "execution_count": null, "id": "b085cd3f-5473-474e-b35c-a1a548d50f0e", "metadata": {}, "outputs": [ @@ -111,16 +56,9 @@ } ], "source": [ - "print(output_text.value)\n", - "if output_text.value == \"OpenVINO models\":\n", - " embeddings_model = \"OpenVINO/bge-base-en-v1.5-fp16-ov\"\n", - " rerank_model = \"OpenVINO/bge-reranker-base-fp16-ov\"\n", - " chat_model = \"OpenVINO/Qwen3-8B-int4-ov\"\n", - "else:\n", - " embeddings_model = \"Alibaba-NLP/gte-large-en-v1.5\"\n", - " rerank_model = \"BAAI/bge-reranker-large\"\n", - " chat_model = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n", - " " + "embeddings_model = \"OpenVINO/bge-base-en-v1.5-fp16-ov\"\n", + "rerank_model = \"OpenVINO/bge-reranker-base-fp16-ov\"\n", + "chat_model = \"OpenVINO/Qwen3-8B-int4-ov\" " ] }, { diff --git a/docs/pull_optimum_cli.md b/docs/pull_optimum_cli.md index 0c046d3932..e44be7541f 100644 --- a/docs/pull_optimum_cli.md +++ b/docs/pull_optimum_cli.md @@ -56,7 +56,7 @@ ovms --pull --source_model "Qwen/Qwen3-4B" --model_repository_path /models --mod ```bash mkdir -p models -docker run -u $(id -u):$(id -g) --rm -v $(pwd)/models:/models:rw openvino/model_server:latest-py --pull --source_model "Qwen/Qwen3-4B" --model_repository_path /models --model_name Qwen3-4B --task text_generation --weight-format int8 +docker run -u $(id -u):$(id -g) -e HF_HOME=/tmp -e TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor --rm -v $(pwd)/models:/models:rw openvino/model_server:latest-py --pull --source_model "Qwen/Qwen3-4B" --model_repository_path /models --model_name Qwen3-4B --task text_generation --weight-format int8 ``` ::: @@ -85,7 +85,7 @@ You can mount the HuggingFace cache to avoid downloading the original model in c Below is an example pull command with optimum model cache directory sharing for model download: ```bash -docker run -v /etc/passwd:/etc/passwd -e HF_HOME=/hf_home/cache --user $(id -u):$(id -g) --group-add=$(id -g) -v ${HOME}/.cache/huggingface/:/hf_home/cache -v $(pwd)/models:/models:rw openvino/model_server:latest-py --pull --model_repository_path /models --source_model meta-llama/Llama-3.2-1B-Instruct --task text_generation --weight-format int8 +docker run -e TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor -e HF_HOME=/hf_home/cache --user $(id -u):$(id -g) --group-add=$(id -g) -v ${HOME}/.cache/huggingface/:/hf_home/cache -v $(pwd)/models:/models:rw openvino/model_server:latest-py --pull --model_repository_path /models --source_model meta-llama/Llama-3.2-1B-Instruct --task text_generation --weight-format int8 ``` or deploy without caching the model files with passed HF_TOKEN for authorization: