-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathhtml_response.html
More file actions
204 lines (198 loc) · 9.59 KB
/
html_response.html
File metadata and controls
204 lines (198 loc) · 9.59 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
<p>To use KServe for LLaMA 4.1, you will need to follow these steps:</p>
<ol>
<li><strong>Deploy a KServe LLM Inference Service</strong>: First, deploy a KServe LLM Inference Service using the <code>llama4</code> model. You can do this by creating a YAML file with the following configuration:</li>
</ol>
<pre><code class="language-yaml">apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama4
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=llama4
- --model_id=meta-llama/meta-llama-4-8b-instruct
- --backend=huggingface
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: HF_TOKEN
</code></pre>
<p>Apply this configuration using <code>kubectl apply -f <filename>.yaml</code>.</p>
<ol start="2">
<li><strong>Get the SERVICE_HOSTNAME</strong>: Run the following command to get the <code>SERVICE_HOSTNAME</code>:</li>
</ol>
<pre><code class="language-bash">SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llama4 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
</code></pre>
<ol start="3">
<li><strong>Install the OpenAI SDK</strong>: Install the OpenAI SDK using pip:</li>
</ol>
<pre><code class="language-bash">pip3 install openai
</code></pre>
<ol start="4">
<li><strong>Create a Python script to interact with the KServe LLM Inference Service</strong>: Create a Python script (e.g., <code>sample_openai.py</code>) to interact with the KServe LLM Inference Service:</li>
</ol>
<pre><code class="language-python">import os
import json
from openai import OpenAI
# Set the API key and service hostname
api_key = "YOUR_API_KEY"
service_hostname = "YOUR_SERVICE_HOSTNAME"
# Create an OpenAI client
client = OpenAI(api_key=api_key)
# Define a function to send requests to the KServe LLM Inference Service
def send_request(prompt):
# Set the request headers and data
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}
data = json.dumps({"prompt": prompt})
# Send the request to the KServe LLM Inference Service
response = client.post(
f"https://{service_hostname}/v1/models/llama4/infer",
headers=headers,
data=data
)
# Return the response
return response.json()
# Test the function
prompt = "Hello, world!"
response = send_request(prompt)
print(response)
</code></pre>
<p>Replace <code>YOUR_API_KEY</code> and <code>YOUR_SERVICE_HOSTNAME</code> with your actual API key and service hostname.</p>
<ol start="5">
<li><strong>Run the Python script</strong>: Run the Python script using <code>python sample_openai.py</code>. This will send a request to the KServe LLM Inference Service and print the response.</li>
</ol>
<p>Note: Make sure you have the necessary dependencies installed, including <code>openai</code> and <code>kubectl</code>. Additionally, ensure that your API key and service hostname are correct.</p>
<br><br><h2>RAG Context</h2><h3>Local Database Search</h3><ul>
<li>The model name for the above example is llama3.</li>
</ul>
<p>How to integrate with OpenAI SDK</p>
<p>Install the OpenAI SDK:</p>
<p>bash pip3 install openai</p>
<p>Create a Python script to interact with the KServe LLM Inference Service and save it as sample_openai.py:</p>
<p>=== "python" ```python from openai import OpenAI</p>
<p>Deployment_url = "</p>
<p>typial chat completion response</p>
<ul>
<li>Integrate KServe LLM Deployment with LLM SDKs</li>
</ul>
<p>This document provides the example of how to integrate KServe LLM Inference Service with the popular LLM SDKs.</p>
<p>Deploy a KServe LLM Inference Service</p>
<p>Please follow this example: Text Generation using LLama3 to deploy a KServe LLM Inference Service.</p>
<p>Get the SERVICE_HOSTNAME by running the following command:</p>
<p>bash SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llama3 -o jsonpath='{.status.url}' | cut -d "/" -f 3)</p>
<ul>
<li>
<pre><code class="language-yaml"></code></pre>
</li>
</ul>
<p>kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama3
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=llama3
- --model_id=meta-llama/meta-llama-3-8b-instruct
- --backend=huggingface
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: HF_TOKEN</p>
<ul>
<li>
<pre><code class="language-yaml"></code></pre>
</li>
</ul>
<p>kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama3
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=llama3
- --model_id=meta-llama/meta-llama-3-8b-instruct
storageUri: hf://meta-llama/meta-llama-3-8b-instruct
resources:
limits:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
requests:</p>
<ul>
<li>=== "yaml" yaml cat <<EOF | kubectl apply -f - apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: huggingface-llama3 spec: predictor: model: modelFormat: name: huggingface args: - --model_name=llama3 - --model_dir=/mnt/models storageUri: hf://meta-llama/meta-llama-3-8b-instruct resources: limits: cpu: "6" memory: 24Gi nvidia.com/gpu: "1" requests: cpu: "6" memory: 24Gi nvidia.com/gpu: "1" env: - name: HF_TOKEN # Option 2 for authenticating with HF_TOKEN valueFrom:</li>
</ul>
<h3>Web Search Results</h3><h3>title</h3>
<p>ktransformers/doc/en/llama4.md at main - GitHub</p>
<h3>url</h3>
<p>https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/llama4.md</p>
<h3>content</h3>
<p>We are pleased to announce that KTransformers now provides experimental support for LLaMA 4 models through the powerful balance_serve backend introduced in v0.2.4.This update is available under the dedicated development branch: support-llama4, specifically targeting the newly released Meta LLaMA 4 model architecture. ⚠️ This support is currently not available on the main branch due to</p>
<h3>score</h3>
<p>0.63615024</p>
<h3>raw_content</h3>
<p>None</p>
<hr />
<h3>title</h3>
<p>Introduction - Kubeflow</p>
<h3>url</h3>
<p>https://www.kubeflow.org/docs/external-add-ons/kserve/introduction/</p>
<h3>content</h3>
<p>KServe is an open-source project that enables serverless inferencing on Kubernetes. KServe provides performant, high abstraction interfaces for common machine learning (ML) frameworks like TensorFlow, XGBoost, scikit-learn, PyTorch, and ONNX to solve production model serving use cases. The following diagram shows the architecture of KServe:</p>
<h3>score</h3>
<p>0.40344992</p>
<h3>raw_content</h3>
<p>None</p>
<hr />
<h3>title</h3>
<p>Short guide to hosting your own llama.cpp openAI compatible ... - Reddit</p>
<h3>url</h3>
<p>https://www.reddit.com/r/LocalLLaMA/comments/15ak5k4/short_guide_to_hosting_your_own_llamacpp_openai/</p>
<h3>content</h3>
<p>Subreddit to discuss about Llama, the large language model created by Meta AI. Short guide to hosting your own llama.cpp openAI compatible web-server llama.cpp added a server component, this server is compiled when you run make as usual. Run the server ./server -m models/wizard-2-13b/ggml-model-q4_1.bin There's a bug with the openAI api unfortunately, you need the api_like_OAI.py file from this branch: https://github.com/ggerganov/llama.cpp/pull/2383, this is it as raw txt: https://raw.githubusercontent.com/ggerganov/llama.cpp/d8a8d0e536cfdaca0135f22d43fda80dc5e47cd8/examples/server/api_like_OAI.py. Replace the examples/server/api_like_OAI.py with the downloaded file Run the openai compatibility server, cd examples/server and python api_like_OAI.py You can access llama's built-in web server by going to localhost:8080 (port from ./server) Reddit Reddit Reddit Communities Best of Reddit Topics Reddit, Inc.</p>
<h3>score</h3>
<p>0.18732621</p>
<h3>raw_content</h3>
<p>None</p>
<hr />
<h3>title</h3>
<p>Documentation | Llama</p>
<h3>url</h3>
<p>https://www.llama.com/docs/get-started/</p>
<h3>content</h3>
<p>Documentation | Llama OverviewModels Getting the Models Running Llama How-To Guides Integration Guides Community Support Running Llama Running Llama Running Llama ModelsLlama 3.3Llama 3.2Llama 3.1Llama Guard 3Prompt GuardOther models ModelsLlama 3.3Llama 3.2Llama 3.1Llama Guard 3Prompt GuardOther models This guide provides information and resources to help you set up Llama including how to access the model, hosting, how-to and integration guides. Llama 3.3 70B Llama 3.2 Quantized Models Llama 3.3 70B (New) Llama 3.2 Lightweight Llama models with new multimodal capabilities that enable Llama to interpret visual information. The Llama 3.2 lightweight models enable Llama to run on phones, tablets, and edge devices. To discover more about what's possible with the Llama family of models, explore the topics below. Llama Everywhere</p>
<h3>score</h3>
<p>0.093620166</p>
<h3>raw_content</h3>
<p>None</p>
<hr />
<h3>title</h3>
<p>Running Meta Llama on Windows</p>
<h3>url</h3>
<p>https://www.llama.com/docs/llama-everywhere/running-meta-llama-on-windows/</p>
<h3>content</h3>
<p>This tutorial is a part of our Build with Meta Llama series, where we demonstrate the capabilities and practical applications of Llama for developers like you, so that you can leverage the benefits that Llama has to offer and incorporate it into your own applications. This tutorial supports the video Running Llama on Windows | Build with Meta Llama, where we learn how to run Llama on Windows</p>
<h3>score</h3>
<p>0.049907673</p>
<h3>raw_content</h3>
<p>None</p>
<hr />