Skip to content

netai369/NetAI-Stack-SE

Repository files navigation

NetAI Stack SE

On-Premise AI Infrastructure for Intel Arc GPUs

Data Sovereignty — Your Data Stays Yours

NetAI Stack SE is built for organizations that cannot afford cloud data leakage. Law firms, SMEs, and compliance-driven businesses run this stack entirely on-premise:

  • No cloud inference: All LLM queries execute locally on your Intel Arc Pro B50.
  • No telemetry: No data leaves your network unless you explicitly configure external integrations.
  • GDPR-ready: Patient/client data, legal documents, and internal knowledge bases remain under your physical control. Enhanced with Microsoft Presidio for automatic PII redaction.
  • EU AI Act compliant: Full transparency documentation and prompt injection protection via Mezzo-Prompt-Guard-v2-Base.
  • Dual-GPU Partitioning: Compliance services (PII-Guard, Security-Guard) offloaded to Alder Lake iGPU to preserve Battlemage dGPU VRAM for the main LLM.

Compliance Features

GDPR Compliance (DSGVO)

The stack includes automated PII (Personally Identifiable Information) detection and redaction:

  • PII-Guard Service: Intercepts all web search queries before they reach SearXNG.
  • Microsoft Presidio: Detects names, locations, IBANs, phone numbers, emails, and more. Optimized for German and English.
  • Automatic Redaction: Replaces PII with generic placeholders (e.g., [REDACTED_NAME], [REDACTED_LOCATION]).
  • Data Minimization: Ensures only anonymized data leaves your network.

EU AI Act Compliance (Article 52)

The stack includes AI safety measures for professional use:

  • Security-Guard Service: Filters all inference requests for prompt injection.
  • Mezzo-Prompt-Guard-v2-Base (IQ4_XS): Highly specialized safety model (~450MB VRAM).
  • iGPU Offloading: Runs on Intel Alder Lake iGPU (/dev/dri/card0) using SYCL.
  • Prompt Injection Detection: Blocks jailbreak attempts, system prompt extraction, malicious code.
  • Human-in-the-Loop: All blocked requests are logged for review with transparency metadata.

Architecture

User → Caddy (:443) → LibreChat (:3080) → Security-Guard (:8778) → llama.cpp (:8080/v1)
                  → SearXNG (:8080) via PII-Guard (:8777)
                  → Hermes Agent (:9119 dashboard, :8642 API)
                  → Hermes WebUI (:8787)
                   → Beszel (:8090 monitoring)
                  → SuperTonic TTS (:8800)
                  → Parakeet STT (:5092)
                  → Auth-Validator (:8081) → LibreChat (:3080) [forward_auth backend]

Core Services

Service Container Port(s) Purpose
Inference netai-inference 8080 llama.cpp with Intel SYCL, hosts Qwen3.6-35B
Auxiliary netai-auxiliary 8080 llama.cpp CPU, hosts Qwen3-0.6B for Hermes Agent compression
Frontend netai-librechat 3080 LibreChat (chat interface, RAG, multi-model, MCP)
Agent netai-hermes 8642, 9119 Hermes Agent (Telegram bot + dashboard)
Hermes WebUI netai-hermes-webui 8787 Full web interface for Hermes
Reverse Proxy netai-caddy 80, 443 TLS termination, path-based routing, URL rewriting
Web Search netai-searxng 8080 Privacy-respecting meta search engine
PII-Guard netai-pii-guard 8777 GDPR PII redaction proxy between LibreChat and SearXNG
Security-Guard netai-security-guard 8778 Mezzo Prompt Guard for prompt injection protection (+ internal llama-server :8779)
TTS netai-tts 8800 SuperTonic TTS (OpenAI-compatible)
STT netai-stt 5092 Parakeet TDT (OpenAI-compatible)
Monitoring netai-beszel 8090 Lightweight system monitoring (CPU, RAM, disk, Docker)
Auth-Validator netai-auth-validator 8081 Validates LibreChat sessions for Caddy forward_auth

Data Flow — LLM Inference (AI Act Protected)

  1. User submits inference request.
  2. Request routed to Security-Guard service (running on Alder Lake iGPU).
  3. Mezzo-Prompt-Guard classifies input (safe/unsafe).
  4. SAFE: Prompt forwarded to Inference Server (Qwen-35B on Arc B50 dGPU).
  5. UNSAFE: Request blocked, logged, transparency report returned.

Safety Guarantee: All adversarial inputs filtered on isolated hardware before reaching the main LLM.

Auxiliary Inference (CPU)

The netai-auxiliary container runs Qwen3-0.6B on CPU via the standard ghcr.io/ggml-org/llama.cpp:server image (no SYCL). Benchmarked at ~65 tok/s generation and ~232 tok/s prompt processing on i5-12600K. Used exclusively by Hermes Agent for context compression — not exposed externally.

Quick Start

1. Prerequisites

  • Ubuntu 24.04 LTS (kernel 6.8+)
  • Intel Arc Pro B50 (Battlemage) GPU
  • Model files in models/Qwen3.6/:
    • Qwen3.6-35B-A3B-UD-IQ2_M.gguf
    • mmproj-F16.gguf
  • Model file in models/Qwen3-0.6B/:
    • Qwen_Qwen3-0.6B-IQ4_XS.gguf

2. Configure Environment

cp .env.example .env
# Edit .env and set:
#   DOMAIN, ADMIN_EMAIL, ADMIN_PASSWORD, TELEGRAM_TOKEN (optional)
#   SSL_CERT_PATH and SSL_KEY_PATH (Let's Encrypt paths)
#   SEARXNG_SECRET (generate with: openssl rand -hex 32)
#   LIBRECHAT_JWT_SECRET and JWT_REFRESH_SECRET (generate with: openssl rand -hex 32)

3. Run Setup

./setup.sh

This will:

  • Update system packages
  • Install Intel GPU drivers (intel-opencl-icd, intel-level-zero-gpu, level-zero)
  • Install Docker & Docker Compose if missing
  • Add your user to render and video groups
  • Auto-detect Intel GPU DRI devices and write them to .env
  • Validate that model files are present
  • Validate SSL certificates

Log out and back in after setup completes so group membership takes effect.

4. Build Custom Images

docker compose build auth-validator caddy security-guard pii-guard speech-stt

5. Start Services

docker compose up -d

6. Register Admin User

Open your browser to https://<your-domain> and register the first user. The first registered user becomes the admin (requires ALLOW_REGISTRATION=true and ALLOW_UNVERIFIED_EMAIL_LOGIN=true).

7. Access Services

Endpoint Description
https://<your-domain>/ LibreChat
https://<your-domain>/agent/ Hermes Agent Dashboard
https://<your-domain>/agent-api/ Hermes Agent API (OpenAI-compatible)
https://<your-domain>/hermes-webui/ Hermes Web UI (full web interface)
https://<your-domain>/search/ SearXNG (Web Search)
https://<your-domain>/tts/ SuperTonic TTS API (OpenAI-compatible)
https://<your-domain>/speech-stt/ Parakeet STT API (OpenAI-compatible)
https://<your-domain>/beszel/ Beszel Monitoring Dashboard
https://<your-domain>/inference/ llama.cpp Inference API
https://<your-domain>/pii-guard/ PII-Guard API
https://<your-domain>/security-guard/ Security-Guard API

Note: HTTP (port 80) automatically redirects to HTTPS (port 443).

8. Run the Test Suite (Optional)

# Fast smoke test (~30-40s) — containers, network, auth, data stores, backup
pytest tests/ -m "not slow" -v

# Full suite including chat completion (~2-3 min)
pytest tests/ -v

# Compliance-specific tests
pytest tests/ -m "pii_guard" -v
pytest tests/ -m "security_guard" -v

Caddy Reverse Proxy Routes

Path Upstream Notes
/ librechat:3080 LibreChat
/search/ searxng:8080 With HTML URL rewriting via replace
/agent/ agent-hermes:9119 Dashboard, HTML/JS/CSS rewriting, auth via LibreChat
/agent-api/ agent-hermes:8642 Gateway API, no auth
/hermes-webui/ hermes-webui:8787 HTML rewriting, auth via LibreChat
/beszel/ netai-beszel:8090 Beszel monitoring dashboard (WebSocket for real-time), auth via LibreChat
/tts/ supertonic-tts:8800 SuperTonic TTS API
/inference/ netai-inference:8080 llama.cpp API
/pii-guard/ pii-guard:8777 PII-Guard API
/security-guard/ security-guard:8778 Security-Guard API
/speech-stt/ speech-stt:5092 Parakeet STT API

Intel SYCL Notes

  • This stack uses the SYCL backend via the official ghcr.io/ggml-org/llama.cpp:server-intel-b8967 image.
  • Ensure your kernel is 6.8 or newer for native Xe/i915 support on Battlemage.
  • The environment variable ONEAPI_DEVICE_SELECTOR=*:gpu is passed to the inference container.
  • Important: llama.cpp SYCL backend only supports discrete Intel Arc GPUs (Xe-HPG+, like Arc Pro B50). Integrated Xe-LP GPUs (UHD 770) are not enumerated by the SYCL backend and cannot be used for inference. The inference-server container is strictly bound to the discrete Arc GPU.
  • There are no NVIDIA/CUDA dependencies in this stack.

SSL / TLS

Caddy handles TLS termination. For production, configure Let's Encrypt certificates. For development, self-signed certificates can be used:

# Place certificates at paths specified in .env
# Caddy will use them for TLS

Testing

A pytest-based integration test suite validates containers, network paths, APIs, authentication, chat completions, data stores, and backups.

Prerequisites: pytest and requests must be installed (pip install pytest requests).

# Fast smoke test (containers, network, auth, data stores, backup — ~30-40s)
pytest tests/ -m "not slow" -v

# Full suite including chat completion (~2-3 min)
pytest tests/ -v

# Parallel execution (requires pytest-xdist)
pytest tests/ -m "not slow" -v -n auto

# Compliance-specific
pytest tests/ -m "pii_guard" -v
pytest tests/ -m "security_guard" -v
Test File Coverage
test_containers.py Docker container status and health checks
test_network_paths.py Caddy routing and inter-service DNS
test_api_endpoints.py LLM inference, LibreChat, Hermes API, SearXNG
test_authentication.py LibreChat login and token validation
test_chat_completion.py End-to-end chat via LibreChat API
test_data_stores.py SQLite integrity, ChromaDB, uploads
test_backup.py Backup script execution and archive validation
test_pii_guard.py GDPR compliance, PII redaction, search proxy
test_security_guard.py Prompt injection detection, filtering, Article 52
test_beszel.py Beszel monitoring metrics
test_hermes_webui.py Hermes compose config, Caddy proxy, .env.example
test_hermes_playwright.py Hermes API server and LibreChat browser tests

Backup & Restore

LibreChat stores all data in MongoDB (persistent volume mongo-data).

Create a Backup

docker compose exec mongo mongodump --archive=/backups/librechat-$(date +%Y%m%d).archive

Restore from Backup

docker compose exec -T mongo mongorestore --archive=< backup-file.archive

Troubleshooting

GPU Not Detected

lspci | grep -i vga | grep -i intel

If empty, verify the GPU is seated and the kernel module is loaded:

sudo dmesg | grep i915
sudo intel_gpu_top

Permission Denied on /dev/dri

Ensure your user is in the video and render groups, then log out and back in:

sudo usermod -aG video,render $USER

Model File Missing

setup.sh will warn you if the expected GGUF files are absent. Download them via Hugging Face CLI or wget and place them in models/Qwen3.6/.

License

This deployment configuration is provided as-is for B2B on-premise deployments. Model weights and upstream container images are subject to their respective licenses.

About

ai Infrastructure for Intel Arc GPUs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors