On-Premise AI Infrastructure for Intel Arc GPUs
NetAI Stack SE is built for organizations that cannot afford cloud data leakage. Law firms, SMEs, and compliance-driven businesses run this stack entirely on-premise:
- No cloud inference: All LLM queries execute locally on your Intel Arc Pro B50.
- No telemetry: No data leaves your network unless you explicitly configure external integrations.
- GDPR-ready: Patient/client data, legal documents, and internal knowledge bases remain under your physical control. Enhanced with Microsoft Presidio for automatic PII redaction.
- EU AI Act compliant: Full transparency documentation and prompt injection protection via Mezzo-Prompt-Guard-v2-Base.
- Dual-GPU Partitioning: Compliance services (PII-Guard, Security-Guard) offloaded to Alder Lake iGPU to preserve Battlemage dGPU VRAM for the main LLM.
The stack includes automated PII (Personally Identifiable Information) detection and redaction:
- PII-Guard Service: Intercepts all web search queries before they reach SearXNG.
- Microsoft Presidio: Detects names, locations, IBANs, phone numbers, emails, and more. Optimized for German and English.
- Automatic Redaction: Replaces PII with generic placeholders (e.g.,
[REDACTED_NAME],[REDACTED_LOCATION]). - Data Minimization: Ensures only anonymized data leaves your network.
The stack includes AI safety measures for professional use:
- Security-Guard Service: Filters all inference requests for prompt injection.
- Mezzo-Prompt-Guard-v2-Base (IQ4_XS): Highly specialized safety model (~450MB VRAM).
- iGPU Offloading: Runs on Intel Alder Lake iGPU (
/dev/dri/card0) using SYCL. - Prompt Injection Detection: Blocks jailbreak attempts, system prompt extraction, malicious code.
- Human-in-the-Loop: All blocked requests are logged for review with transparency metadata.
User → Caddy (:443) → LibreChat (:3080) → Security-Guard (:8778) → llama.cpp (:8080/v1)
→ SearXNG (:8080) via PII-Guard (:8777)
→ Hermes Agent (:9119 dashboard, :8642 API)
→ Hermes WebUI (:8787)
→ Beszel (:8090 monitoring)
→ SuperTonic TTS (:8800)
→ Parakeet STT (:5092)
→ Auth-Validator (:8081) → LibreChat (:3080) [forward_auth backend]
| Service | Container | Port(s) | Purpose |
|---|---|---|---|
| Inference | netai-inference |
8080 | llama.cpp with Intel SYCL, hosts Qwen3.6-35B |
| Auxiliary | netai-auxiliary |
8080 | llama.cpp CPU, hosts Qwen3-0.6B for Hermes Agent compression |
| Frontend | netai-librechat |
3080 | LibreChat (chat interface, RAG, multi-model, MCP) |
| Agent | netai-hermes |
8642, 9119 | Hermes Agent (Telegram bot + dashboard) |
| Hermes WebUI | netai-hermes-webui |
8787 | Full web interface for Hermes |
| Reverse Proxy | netai-caddy |
80, 443 | TLS termination, path-based routing, URL rewriting |
| Web Search | netai-searxng |
8080 | Privacy-respecting meta search engine |
| PII-Guard | netai-pii-guard |
8777 | GDPR PII redaction proxy between LibreChat and SearXNG |
| Security-Guard | netai-security-guard |
8778 | Mezzo Prompt Guard for prompt injection protection (+ internal llama-server :8779) |
| TTS | netai-tts |
8800 | SuperTonic TTS (OpenAI-compatible) |
| STT | netai-stt |
5092 | Parakeet TDT (OpenAI-compatible) |
| Monitoring | netai-beszel |
8090 | Lightweight system monitoring (CPU, RAM, disk, Docker) |
| Auth-Validator | netai-auth-validator |
8081 | Validates LibreChat sessions for Caddy forward_auth |
- User submits inference request.
- Request routed to Security-Guard service (running on Alder Lake iGPU).
- Mezzo-Prompt-Guard classifies input (safe/unsafe).
- SAFE: Prompt forwarded to Inference Server (Qwen-35B on Arc B50 dGPU).
- UNSAFE: Request blocked, logged, transparency report returned.
Safety Guarantee: All adversarial inputs filtered on isolated hardware before reaching the main LLM.
The netai-auxiliary container runs Qwen3-0.6B on CPU via the standard ghcr.io/ggml-org/llama.cpp:server image (no SYCL). Benchmarked at ~65 tok/s generation and ~232 tok/s prompt processing on i5-12600K. Used exclusively by Hermes Agent for context compression — not exposed externally.
- Ubuntu 24.04 LTS (kernel 6.8+)
- Intel Arc Pro B50 (Battlemage) GPU
- Model files in
models/Qwen3.6/:Qwen3.6-35B-A3B-UD-IQ2_M.ggufmmproj-F16.gguf
- Model file in
models/Qwen3-0.6B/:Qwen_Qwen3-0.6B-IQ4_XS.gguf
cp .env.example .env
# Edit .env and set:
# DOMAIN, ADMIN_EMAIL, ADMIN_PASSWORD, TELEGRAM_TOKEN (optional)
# SSL_CERT_PATH and SSL_KEY_PATH (Let's Encrypt paths)
# SEARXNG_SECRET (generate with: openssl rand -hex 32)
# LIBRECHAT_JWT_SECRET and JWT_REFRESH_SECRET (generate with: openssl rand -hex 32)./setup.shThis will:
- Update system packages
- Install Intel GPU drivers (
intel-opencl-icd,intel-level-zero-gpu,level-zero) - Install Docker & Docker Compose if missing
- Add your user to
renderandvideogroups - Auto-detect Intel GPU DRI devices and write them to
.env - Validate that model files are present
- Validate SSL certificates
Log out and back in after setup completes so group membership takes effect.
docker compose build auth-validator caddy security-guard pii-guard speech-sttdocker compose up -dOpen your browser to https://<your-domain> and register the first user.
The first registered user becomes the admin (requires ALLOW_REGISTRATION=true and ALLOW_UNVERIFIED_EMAIL_LOGIN=true).
| Endpoint | Description |
|---|---|
https://<your-domain>/ |
LibreChat |
https://<your-domain>/agent/ |
Hermes Agent Dashboard |
https://<your-domain>/agent-api/ |
Hermes Agent API (OpenAI-compatible) |
https://<your-domain>/hermes-webui/ |
Hermes Web UI (full web interface) |
https://<your-domain>/search/ |
SearXNG (Web Search) |
https://<your-domain>/tts/ |
SuperTonic TTS API (OpenAI-compatible) |
https://<your-domain>/speech-stt/ |
Parakeet STT API (OpenAI-compatible) |
https://<your-domain>/beszel/ |
Beszel Monitoring Dashboard |
https://<your-domain>/inference/ |
llama.cpp Inference API |
https://<your-domain>/pii-guard/ |
PII-Guard API |
https://<your-domain>/security-guard/ |
Security-Guard API |
Note: HTTP (port 80) automatically redirects to HTTPS (port 443).
# Fast smoke test (~30-40s) — containers, network, auth, data stores, backup
pytest tests/ -m "not slow" -v
# Full suite including chat completion (~2-3 min)
pytest tests/ -v
# Compliance-specific tests
pytest tests/ -m "pii_guard" -v
pytest tests/ -m "security_guard" -v| Path | Upstream | Notes |
|---|---|---|
/ |
librechat:3080 |
LibreChat |
/search/ |
searxng:8080 |
With HTML URL rewriting via replace |
/agent/ |
agent-hermes:9119 |
Dashboard, HTML/JS/CSS rewriting, auth via LibreChat |
/agent-api/ |
agent-hermes:8642 |
Gateway API, no auth |
/hermes-webui/ |
hermes-webui:8787 |
HTML rewriting, auth via LibreChat |
/beszel/ |
netai-beszel:8090 |
Beszel monitoring dashboard (WebSocket for real-time), auth via LibreChat |
/tts/ |
supertonic-tts:8800 |
SuperTonic TTS API |
/inference/ |
netai-inference:8080 |
llama.cpp API |
/pii-guard/ |
pii-guard:8777 |
PII-Guard API |
/security-guard/ |
security-guard:8778 |
Security-Guard API |
/speech-stt/ |
speech-stt:5092 |
Parakeet STT API |
- This stack uses the SYCL backend via the official
ghcr.io/ggml-org/llama.cpp:server-intel-b8967image. - Ensure your kernel is 6.8 or newer for native Xe/i915 support on Battlemage.
- The environment variable
ONEAPI_DEVICE_SELECTOR=*:gpuis passed to the inference container. - Important: llama.cpp SYCL backend only supports discrete Intel Arc GPUs (Xe-HPG+, like Arc Pro B50). Integrated Xe-LP GPUs (UHD 770) are not enumerated by the SYCL backend and cannot be used for inference. The
inference-servercontainer is strictly bound to the discrete Arc GPU. - There are no NVIDIA/CUDA dependencies in this stack.
Caddy handles TLS termination. For production, configure Let's Encrypt certificates. For development, self-signed certificates can be used:
# Place certificates at paths specified in .env
# Caddy will use them for TLSA pytest-based integration test suite validates containers, network paths, APIs, authentication, chat completions, data stores, and backups.
Prerequisites: pytest and requests must be installed (pip install pytest requests).
# Fast smoke test (containers, network, auth, data stores, backup — ~30-40s)
pytest tests/ -m "not slow" -v
# Full suite including chat completion (~2-3 min)
pytest tests/ -v
# Parallel execution (requires pytest-xdist)
pytest tests/ -m "not slow" -v -n auto
# Compliance-specific
pytest tests/ -m "pii_guard" -v
pytest tests/ -m "security_guard" -v| Test File | Coverage |
|---|---|
test_containers.py |
Docker container status and health checks |
test_network_paths.py |
Caddy routing and inter-service DNS |
test_api_endpoints.py |
LLM inference, LibreChat, Hermes API, SearXNG |
test_authentication.py |
LibreChat login and token validation |
test_chat_completion.py |
End-to-end chat via LibreChat API |
test_data_stores.py |
SQLite integrity, ChromaDB, uploads |
test_backup.py |
Backup script execution and archive validation |
test_pii_guard.py |
GDPR compliance, PII redaction, search proxy |
test_security_guard.py |
Prompt injection detection, filtering, Article 52 |
test_beszel.py |
Beszel monitoring metrics |
test_hermes_webui.py |
Hermes compose config, Caddy proxy, .env.example |
test_hermes_playwright.py |
Hermes API server and LibreChat browser tests |
LibreChat stores all data in MongoDB (persistent volume mongo-data).
docker compose exec mongo mongodump --archive=/backups/librechat-$(date +%Y%m%d).archivedocker compose exec -T mongo mongorestore --archive=< backup-file.archivelspci | grep -i vga | grep -i intelIf empty, verify the GPU is seated and the kernel module is loaded:
sudo dmesg | grep i915
sudo intel_gpu_topEnsure your user is in the video and render groups, then log out and back in:
sudo usermod -aG video,render $USERsetup.sh will warn you if the expected GGUF files are absent. Download them via Hugging Face CLI or wget and place them in models/Qwen3.6/.
This deployment configuration is provided as-is for B2B on-premise deployments. Model weights and upstream container images are subject to their respective licenses.