NetAI Stack SE

On-Premise AI Infrastructure for Intel Arc GPUs

Data Sovereignty — Your Data Stays Yours

NetAI Stack SE is built for organizations that cannot afford cloud data leakage. Law firms, SMEs, and compliance-driven businesses run this stack entirely on-premise:

No cloud inference: All LLM queries execute locally on your Intel Arc Pro B50.
No telemetry: No data leaves your network unless you explicitly configure external integrations.
GDPR-ready: Patient/client data, legal documents, and internal knowledge bases remain under your physical control. Enhanced with Microsoft Presidio for automatic PII redaction.
EU AI Act compliant: Full transparency documentation and prompt injection protection via Mezzo-Prompt-Guard-v2-Base.
Dual-GPU Partitioning: Compliance services (PII-Guard, Security-Guard) offloaded to Alder Lake iGPU to preserve Battlemage dGPU VRAM for the main LLM.

Compliance Features

GDPR Compliance (DSGVO)

The stack includes automated PII (Personally Identifiable Information) detection and redaction:

PII-Guard Service: Intercepts all web search queries before they reach SearXNG.
Microsoft Presidio: Detects names, locations, IBANs, phone numbers, emails, and more. Optimized for German and English.
Automatic Redaction: Replaces PII with generic placeholders (e.g., [REDACTED_NAME], [REDACTED_LOCATION]).
Data Minimization: Ensures only anonymized data leaves your network.

EU AI Act Compliance (Article 52)

The stack includes AI safety measures for professional use:

Security-Guard Service: Filters all inference requests for prompt injection.
Mezzo-Prompt-Guard-v2-Base (IQ4_XS): Highly specialized safety model (~450MB VRAM).
iGPU Offloading: Runs on Intel Alder Lake iGPU (/dev/dri/card0) using SYCL.
Prompt Injection Detection: Blocks jailbreak attempts, system prompt extraction, malicious code.
Human-in-the-Loop: All blocked requests are logged for review with transparency metadata.

Architecture

User → Caddy (:443) → LibreChat (:3080) → Security-Guard (:8778) → llama.cpp (:8080/v1)
                  → SearXNG (:8080) via PII-Guard (:8777)
                  → Hermes Agent (:9119 dashboard, :8642 API)
                  → Hermes WebUI (:8787)
                   → Beszel (:8090 monitoring)
                  → SuperTonic TTS (:8800)
                  → Parakeet STT (:5092)
                  → Auth-Validator (:8081) → LibreChat (:3080) [forward_auth backend]

Core Services

Service	Container	Port(s)	Purpose
Inference	`netai-inference`	8080	llama.cpp with Intel SYCL, hosts Qwen3.6-35B
Auxiliary	`netai-auxiliary`	8080	llama.cpp CPU, hosts Qwen3-0.6B for Hermes Agent compression
Frontend	`netai-librechat`	3080	LibreChat (chat interface, RAG, multi-model, MCP)
Agent	`netai-hermes`	8642, 9119	Hermes Agent (Telegram bot + dashboard)
Hermes WebUI	`netai-hermes-webui`	8787	Full web interface for Hermes
Reverse Proxy	`netai-caddy`	80, 443	TLS termination, path-based routing, URL rewriting
Web Search	`netai-searxng`	8080	Privacy-respecting meta search engine
PII-Guard	`netai-pii-guard`	8777	GDPR PII redaction proxy between LibreChat and SearXNG
Security-Guard	`netai-security-guard`	8778	Mezzo Prompt Guard for prompt injection protection (+ internal llama-server :8779)
TTS	`netai-tts`	8800	SuperTonic TTS (OpenAI-compatible)
STT	`netai-stt`	5092	Parakeet TDT (OpenAI-compatible)
Monitoring	`netai-beszel`	8090	Lightweight system monitoring (CPU, RAM, disk, Docker)
Auth-Validator	`netai-auth-validator`	8081	Validates LibreChat sessions for Caddy forward_auth

Data Flow — LLM Inference (AI Act Protected)

User submits inference request.
Request routed to Security-Guard service (running on Alder Lake iGPU).
Mezzo-Prompt-Guard classifies input (safe/unsafe).
SAFE: Prompt forwarded to Inference Server (Qwen-35B on Arc B50 dGPU).
UNSAFE: Request blocked, logged, transparency report returned.

Safety Guarantee: All adversarial inputs filtered on isolated hardware before reaching the main LLM.

Auxiliary Inference (CPU)

The netai-auxiliary container runs Qwen3-0.6B on CPU via the standard ghcr.io/ggml-org/llama.cpp:server image (no SYCL). Benchmarked at ~65 tok/s generation and ~232 tok/s prompt processing on i5-12600K. Used exclusively by Hermes Agent for context compression — not exposed externally.

Quick Start

1. Prerequisites

Ubuntu 24.04 LTS (kernel 6.8+)
Intel Arc Pro B50 (Battlemage) GPU
Model files in models/Qwen3.6/:
- Qwen3.6-35B-A3B-UD-IQ2_M.gguf
- mmproj-F16.gguf
Model file in models/Qwen3-0.6B/:
- Qwen_Qwen3-0.6B-IQ4_XS.gguf

2. Configure Environment

cp .env.example .env
# Edit .env and set:
#   DOMAIN, ADMIN_EMAIL, ADMIN_PASSWORD, TELEGRAM_TOKEN (optional)
#   SSL_CERT_PATH and SSL_KEY_PATH (Let's Encrypt paths)
#   SEARXNG_SECRET (generate with: openssl rand -hex 32)
#   LIBRECHAT_JWT_SECRET and JWT_REFRESH_SECRET (generate with: openssl rand -hex 32)

3. Run Setup

./setup.sh

This will:

Update system packages
Install Intel GPU drivers (intel-opencl-icd, intel-level-zero-gpu, level-zero)
Install Docker & Docker Compose if missing
Add your user to render and video groups
Auto-detect Intel GPU DRI devices and write them to .env
Validate that model files are present
Validate SSL certificates

Log out and back in after setup completes so group membership takes effect.

4. Build Custom Images

docker compose build auth-validator caddy security-guard pii-guard speech-stt

5. Start Services

docker compose up -d

6. Register Admin User

Open your browser to https://<your-domain> and register the first user. The first registered user becomes the admin (requires ALLOW_REGISTRATION=true and ALLOW_UNVERIFIED_EMAIL_LOGIN=true).

7. Access Services

Endpoint	Description
`https://<your-domain>/`	LibreChat
`https://<your-domain>/agent/`	Hermes Agent Dashboard
`https://<your-domain>/agent-api/`	Hermes Agent API (OpenAI-compatible)
`https://<your-domain>/hermes-webui/`	Hermes Web UI (full web interface)
`https://<your-domain>/search/`	SearXNG (Web Search)
`https://<your-domain>/tts/`	SuperTonic TTS API (OpenAI-compatible)
`https://<your-domain>/speech-stt/`	Parakeet STT API (OpenAI-compatible)
`https://<your-domain>/beszel/`	Beszel Monitoring Dashboard
`https://<your-domain>/inference/`	llama.cpp Inference API
`https://<your-domain>/pii-guard/`	PII-Guard API
`https://<your-domain>/security-guard/`	Security-Guard API

Note: HTTP (port 80) automatically redirects to HTTPS (port 443).

8. Run the Test Suite (Optional)

# Fast smoke test (~30-40s) — containers, network, auth, data stores, backup
pytest tests/ -m "not slow" -v

# Full suite including chat completion (~2-3 min)
pytest tests/ -v

# Compliance-specific tests
pytest tests/ -m "pii_guard" -v
pytest tests/ -m "security_guard" -v

Caddy Reverse Proxy Routes

Path	Upstream	Notes
`/`	`librechat:3080`	LibreChat
`/search/`	`searxng:8080`	With HTML URL rewriting via `replace`
`/agent/`	`agent-hermes:9119`	Dashboard, HTML/JS/CSS rewriting, auth via LibreChat
`/agent-api/`	`agent-hermes:8642`	Gateway API, no auth
`/hermes-webui/`	`hermes-webui:8787`	HTML rewriting, auth via LibreChat
`/beszel/`	`netai-beszel:8090`	Beszel monitoring dashboard (WebSocket for real-time), auth via LibreChat
`/tts/`	`supertonic-tts:8800`	SuperTonic TTS API
`/inference/`	`netai-inference:8080`	llama.cpp API
`/pii-guard/`	`pii-guard:8777`	PII-Guard API
`/security-guard/`	`security-guard:8778`	Security-Guard API
`/speech-stt/`	`speech-stt:5092`	Parakeet STT API

Intel SYCL Notes

This stack uses the SYCL backend via the official ghcr.io/ggml-org/llama.cpp:server-intel-b8967 image.
Ensure your kernel is 6.8 or newer for native Xe/i915 support on Battlemage.
The environment variable ONEAPI_DEVICE_SELECTOR=*:gpu is passed to the inference container.
Important: llama.cpp SYCL backend only supports discrete Intel Arc GPUs (Xe-HPG+, like Arc Pro B50). Integrated Xe-LP GPUs (UHD 770) are not enumerated by the SYCL backend and cannot be used for inference. The inference-server container is strictly bound to the discrete Arc GPU.
There are no NVIDIA/CUDA dependencies in this stack.

SSL / TLS

Caddy handles TLS termination. For production, configure Let's Encrypt certificates. For development, self-signed certificates can be used:

# Place certificates at paths specified in .env
# Caddy will use them for TLS

Testing

A pytest-based integration test suite validates containers, network paths, APIs, authentication, chat completions, data stores, and backups.

Prerequisites: pytest and requests must be installed (pip install pytest requests).

# Fast smoke test (containers, network, auth, data stores, backup — ~30-40s)
pytest tests/ -m "not slow" -v

# Full suite including chat completion (~2-3 min)
pytest tests/ -v

# Parallel execution (requires pytest-xdist)
pytest tests/ -m "not slow" -v -n auto

# Compliance-specific
pytest tests/ -m "pii_guard" -v
pytest tests/ -m "security_guard" -v

Test File	Coverage
`test_containers.py`	Docker container status and health checks
`test_network_paths.py`	Caddy routing and inter-service DNS
`test_api_endpoints.py`	LLM inference, LibreChat, Hermes API, SearXNG
`test_authentication.py`	LibreChat login and token validation
`test_chat_completion.py`	End-to-end chat via LibreChat API
`test_data_stores.py`	SQLite integrity, ChromaDB, uploads
`test_backup.py`	Backup script execution and archive validation
`test_pii_guard.py`	GDPR compliance, PII redaction, search proxy
`test_security_guard.py`	Prompt injection detection, filtering, Article 52
`test_beszel.py`	Beszel monitoring metrics
`test_hermes_webui.py`	Hermes compose config, Caddy proxy, .env.example
`test_hermes_playwright.py`	Hermes API server and LibreChat browser tests

Backup & Restore

LibreChat stores all data in MongoDB (persistent volume mongo-data).

Create a Backup

docker compose exec mongo mongodump --archive=/backups/librechat-$(date +%Y%m%d).archive

Restore from Backup

docker compose exec -T mongo mongorestore --archive=< backup-file.archive

Troubleshooting

GPU Not Detected

lspci | grep -i vga | grep -i intel

If empty, verify the GPU is seated and the kernel module is loaded:

sudo dmesg | grep i915
sudo intel_gpu_top

Permission Denied on /dev/dri

Ensure your user is in the video and render groups, then log out and back in:

sudo usermod -aG video,render $USER

Model File Missing

setup.sh will warn you if the expected GGUF files are absent. Download them via Hugging Face CLI or wget and place them in models/Qwen3.6/.

License

This deployment configuration is provided as-is for B2B on-premise deployments. Model weights and upstream container images are subject to their respective licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.kilo		.kilo
.opencode/plans		.opencode/plans
XPU_tests		XPU_tests
config		config
data		data
docs		docs
scripts		scripts
test_files		test_files
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile.test-runner		Dockerfile.test-runner
NetAI_Safety_Sheet.md		NetAI_Safety_Sheet.md
PLAN.md		PLAN.md
README.md		README.md
dev-rebuild.sh		dev-rebuild.sh
docker-compose.yml		docker-compose.yml
models		models
report.md		report.md
setup.sh		setup.sh
start_server.sh		start_server.sh
stop_server.sh		stop_server.sh
test-suite.py		test-suite.py

Folders and files

Latest commit

History

Repository files navigation

NetAI Stack SE

Data Sovereignty — Your Data Stays Yours

Compliance Features

GDPR Compliance (DSGVO)

EU AI Act Compliance (Article 52)

Architecture

Core Services

Data Flow — LLM Inference (AI Act Protected)

Auxiliary Inference (CPU)

Quick Start

1. Prerequisites

2. Configure Environment

3. Run Setup

4. Build Custom Images

5. Start Services

6. Register Admin User

7. Access Services

8. Run the Test Suite (Optional)

Caddy Reverse Proxy Routes

Intel SYCL Notes

SSL / TLS

Testing

Backup & Restore

Create a Backup

Restore from Backup

Troubleshooting

GPU Not Detected

Permission Denied on /dev/dri

Model File Missing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages