Version: 1.0
Last Updated: April 2026
Target Audience: Operations Teams, SREs, Management
- Service Level Objectives
- SLA Definitions
- Monitoring Strategy
- Prometheus Alerting Rules
- Grafana Dashboards
- SLA Reporting
- Incident Response
ThemisDB uses the following SLO tiers based on service criticality:
Target Availability: 99.9%
Allowed Downtime: 43.8 minutes/month (8.76 hours/year)
Services:
- Production inference endpoints
- Real-time LLM APIs
- Critical customer-facing services
Performance Targets:
- P50 Latency: < 50ms
- P95 Latency: < 100ms
- P99 Latency: < 200ms
- Error Rate: < 0.1%
Target Availability: 99.5%
Allowed Downtime: 3.6 hours/month (43.8 hours/year)
Services:
- Model training workloads
- Batch processing
- Scheduled inference jobs
Performance Targets:
- P50 Latency: < 100ms
- P95 Latency: < 500ms
- P99 Latency: < 1000ms
- Error Rate: < 0.5%
Target Availability: 99.0%
Allowed Downtime: 7.2 hours/month (87.6 hours/year)
Services:
- Development environments
- Experimental features
- Non-critical batch jobs
Performance Targets:
- P50 Latency: < 500ms
- P95 Latency: < 2000ms
- P99 Latency: < 5000ms
- Error Rate: < 1.0%
Measurement Period: Calendar month (UTC)
Calculation:
Availability % = ((Total Minutes - Downtime Minutes) / Total Minutes) × 100
Exclusions (not counted as downtime):
- Scheduled maintenance (with 7 days notice)
- Customer-caused issues
- Force majeure events
- Beta/experimental features
SLA Credits (for Tier 1):
- 99.0-99.9%: 10% monthly credit
- 95.0-99.0%: 25% monthly credit
- <95.0%: 50% monthly credit
Inference Latency (P95):
- Tier 1: < 100ms
- Tier 2: < 500ms
- Tier 3: < 2000ms
Throughput:
- Tier 1: > 1000 req/sec per GPU
- Tier 2: > 500 req/sec per GPU
- Tier 3: Best effort
Training Performance:
- GPU Utilization: > 85%
- Training Steps: As estimated ± 10%
Durability:
- 99.999999999% (11 nines) - Multi-region replication
- Zero data loss for committed transactions
Backup SLA:
- RPO: 5 minutes (Tier 1), 1 hour (Tier 2), 24 hours (Tier 3)
- RTO: 1 hour (Tier 1), 4 hours (Tier 2), 24 hours (Tier 3)
- Backup success rate: > 99.9%
Infrastructure Metrics:
# Prometheus scrape configuration
scrape_configs:
- job_name: 'themisdb-tier1'
scrape_interval: 10s
scrape_timeout: 5s
static_configs:
- targets: ['themisdb-tier1:9091']
labels:
tier: 'tier1'
service: 'inference'
- job_name: 'themisdb-tier2'
scrape_interval: 30s
static_configs:
- targets: ['themisdb-tier2:9092']
labels:
tier: 'tier2'
service: 'training'Key Metrics:
-
Availability Metrics:
themisdb_up: Service health (0/1)themisdb_http_requests_total: Total requeststhemisdb_http_requests_failed: Failed requests
-
Performance Metrics:
themisdb_request_duration_seconds: Request latency histogramthemisdb_inference_duration_seconds: Inference timethemisdb_gpu_utilization: GPU usage percentage
-
Resource Metrics:
themisdb_gpu_memory_used_bytes: GPU memory usagethemisdb_gpu_temperature_celsius: GPU temperaturethemisdb_disk_usage_bytes: Disk usage
-
Business Metrics:
themisdb_tokens_generated_total: Tokens generated countthemisdb_models_loaded: Active models countthemisdb_active_users: Concurrent users
# /etc/prometheus/rules/slo.yml
groups:
- name: slo_rules
interval: 1m
rules:
# Availability SLO
- record: slo:availability:ratio
expr: |
sum(rate(themisdb_http_requests_total[5m]))
/
(sum(rate(themisdb_http_requests_total[5m])) + sum(rate(themisdb_http_requests_failed[5m])))
labels:
tier: "tier1"
# Latency SLO (P95 < 100ms for Tier 1)
- record: slo:latency:p95
expr: histogram_quantile(0.95, rate(themisdb_request_duration_seconds_bucket[5m]))
labels:
tier: "tier1"
# Error rate SLO
- record: slo:error_rate:ratio
expr: |
sum(rate(themisdb_http_requests_failed[5m]))
/
sum(rate(themisdb_http_requests_total[5m]))
labels:
tier: "tier1"
# Error budget remaining (monthly)
- record: slo:error_budget:remaining
expr: |
1 - (
sum(rate(themisdb_downtime_seconds_total[30d]))
/
(30 * 24 * 60 * 60 * 0.001) # 0.1% error budget
)
labels:
tier: "tier1"# /etc/prometheus/rules/alerts_sla.yml
groups:
- name: sla_critical
interval: 1m
rules:
# Availability SLA violation
- alert: SLAAvailabilityTier1Critical
expr: |
slo:availability:ratio{tier="tier1"} < 0.999
for: 5m
labels:
severity: critical
tier: tier1
category: sla
annotations:
summary: "Tier 1 availability SLA violation"
description: "Availability is {{ $value | humanizePercentage }}, below 99.9% target. Error budget burning."
runbook: "https://docs.themisdb.io/runbooks/#sla-availability-violation"
# Latency SLA violation
- alert: SLALatencyTier1Critical
expr: |
slo:latency:p95{tier="tier1"} > 0.1 # 100ms
for: 5m
labels:
severity: critical
tier: tier1
category: sla
annotations:
summary: "Tier 1 P95 latency SLA violation"
description: "P95 latency is {{ $value | humanizeDuration }}, above 100ms target."
runbook: "https://docs.themisdb.io/runbooks/#sla-latency-violation"
# Error rate SLA violation
- alert: SLAErrorRateTier1Critical
expr: |
slo:error_rate:ratio{tier="tier1"} > 0.001 # 0.1%
for: 5m
labels:
severity: critical
tier: tier1
category: sla
annotations:
summary: "Tier 1 error rate SLA violation"
description: "Error rate is {{ $value | humanizePercentage }}, above 0.1% target."
runbook: "https://docs.themisdb.io/runbooks/#sla-error-rate-violation"
# Error budget burn rate (fast burn)
- alert: SLAErrorBudgetFastBurn
expr: |
slo:error_budget:remaining{tier="tier1"} < 0.9
and
rate(slo:error_budget:remaining{tier="tier1"}[1h]) < -0.01 # Burning >1% per hour
for: 5m
labels:
severity: critical
tier: tier1
category: sla
annotations:
summary: "Error budget burning rapidly"
description: "Error budget at {{ $value | humanizePercentage }}, burning fast. Immediate action required."
runbook: "https://docs.themisdb.io/runbooks/#error-budget-management"
- name: sla_warning
interval: 1m
rules:
# Error budget warning (50% remaining)
- alert: SLAErrorBudgetLow
expr: |
slo:error_budget:remaining{tier="tier1"} < 0.5
for: 10m
labels:
severity: warning
tier: tier1
category: sla
annotations:
summary: "Error budget low"
description: "Error budget at {{ $value | humanizePercentage }}, below 50%. Review incidents and consider freezing deployments."
# GPU utilization below target
- alert: SLAGPUUtilizationLow
expr: |
avg(themisdb_gpu_utilization) < 85
for: 30m
labels:
severity: warning
category: sla
annotations:
summary: "GPU utilization below SLA target"
description: "Average GPU utilization is {{ $value }}%, below 85% target for training workloads."
# Backup SLA warning
- alert: SLABackupDelayed
expr: |
time() - themisdb_last_backup_timestamp_seconds > 3600 # 1 hour
for: 5m
labels:
severity: warning
category: sla
tier: tier1
annotations:
summary: "Backup SLA at risk"
description: "Last backup was {{ $value | humanizeDuration }} ago, exceeding 1 hour RPO target."
runbook: "https://docs.themisdb.io/runbooks/#backup-failure"
- name: sla_tier2
interval: 1m
rules:
# Tier 2 availability
- alert: SLAAvailabilityTier2Warning
expr: |
slo:availability:ratio{tier="tier2"} < 0.995
for: 10m
labels:
severity: warning
tier: tier2
category: sla
annotations:
summary: "Tier 2 availability SLA at risk"
description: "Availability is {{ $value | humanizePercentage }}, approaching 99.5% target."# /etc/prometheus/rules/alerts_gpu.yml
groups:
- name: gpu_sla
interval: 30s
rules:
# GPU failure
- alert: SLAGPUFailure
expr: |
themisdb_gpu_health_status == 0
for: 1m
labels:
severity: critical
category: sla
annotations:
summary: "GPU failure detected"
description: "GPU {{ $labels.gpu_id }} on {{ $labels.instance }} has failed."
runbook: "https://docs.themisdb.io/runbooks/#gpu-failure-response"
# GPU temperature critical
- alert: SLAGPUTemperatureCritical
expr: |
themisdb_gpu_temperature_celsius > 85
for: 5m
labels:
severity: critical
category: sla
annotations:
summary: "GPU temperature critical"
description: "GPU {{ $labels.gpu_id }} temperature is {{ $value }}°C, above 85°C threshold."
runbook: "https://docs.themisdb.io/runbooks/#gpu-thermal-issues"
# GPU memory exhausted
- alert: SLAGPUMemoryExhausted
expr: |
(themisdb_gpu_memory_used_bytes / themisdb_gpu_memory_total_bytes) > 0.95
for: 5m
labels:
severity: warning
category: sla
annotations:
summary: "GPU memory usage critical"
description: "GPU {{ $labels.gpu_id }} memory usage is {{ $value | humanizePercentage }}, above 95%."Create dashboard at /etc/grafana/provisioning/dashboards/sla-overview.json:
{
"dashboard": {
"title": "SLA Overview",
"tags": ["sla", "overview"],
"timezone": "UTC",
"panels": [
{
"id": 1,
"title": "Tier 1 Availability (Monthly)",
"type": "stat",
"targets": [
{
"expr": "slo:availability:ratio{tier=\"tier1\"} * 100",
"legendFormat": "Availability %"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"value": 0, "color": "red"},
{"value": 99.5, "color": "yellow"},
{"value": 99.9, "color": "green"}
]
}
}
}
},
{
"id": 2,
"title": "P95 Latency (5m)",
"type": "graph",
"targets": [
{
"expr": "slo:latency:p95{tier=\"tier1\"} * 1000",
"legendFormat": "Tier 1"
},
{
"expr": "slo:latency:p95{tier=\"tier2\"} * 1000",
"legendFormat": "Tier 2"
}
],
"yaxes": [
{
"label": "Latency (ms)",
"format": "short"
}
]
},
{
"id": 3,
"title": "Error Budget Remaining",
"type": "gauge",
"targets": [
{
"expr": "slo:error_budget:remaining{tier=\"tier1\"} * 100",
"legendFormat": "Tier 1 Budget"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"value": 0, "color": "red"},
{"value": 25, "color": "orange"},
{"value": 50, "color": "yellow"},
{"value": 75, "color": "green"}
]
}
}
}
},
{
"id": 4,
"title": "Request Success Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(themisdb_http_requests_total[5m])) - sum(rate(themisdb_http_requests_failed[5m]))",
"legendFormat": "Successful Requests/sec"
},
{
"expr": "sum(rate(themisdb_http_requests_failed[5m]))",
"legendFormat": "Failed Requests/sec"
}
]
},
{
"id": 5,
"title": "GPU Utilization",
"type": "graph",
"targets": [
{
"expr": "avg(themisdb_gpu_utilization) by (gpu_id)",
"legendFormat": "GPU {{ gpu_id }}"
}
],
"yaxes": [
{
"label": "Utilization %",
"format": "percent",
"max": 100,
"min": 0
}
]
},
{
"id": 6,
"title": "Downtime (Last 30 Days)",
"type": "stat",
"targets": [
{
"expr": "sum(themisdb_downtime_seconds_total[30d]) / 60",
"legendFormat": "Downtime (minutes)"
}
],
"fieldConfig": {
"defaults": {
"unit": "m",
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 43.8, "color": "yellow"},
{"value": 87.6, "color": "red"}
]
}
}
}
}
]
}
}Deploy dashboards:
# Copy dashboard configurations
sudo cp /path/to/dashboards/*.json /etc/grafana/provisioning/dashboards/
# Restart Grafana
sudo systemctl restart grafana-serverAvailable Dashboards:
sla-overview.json- High-level SLA metricssla-detailed.json- Detailed per-service SLA trackingerror-budget.json- Error budget burn rate and remaininggpu-performance.json- GPU-specific performance metricsinference-latency.json- Inference latency breakdown
# /usr/local/bin/generate-sla-report.sh
#!/bin/bash
MONTH=$(date -d "last month" +%Y-%m)
OUTPUT="/reports/sla-report-${MONTH}.pdf"
# Generate report
themisdb-cli sla report \
--month "${MONTH}" \
--output "${OUTPUT}" \
--include-metrics \
--include-incidents \
--format pdf
# Email report
mail -s "ThemisDB SLA Report - ${MONTH}" \
-A "${OUTPUT}" \
stakeholders@example.com < /dev/nullExecutive Summary:
- Overall availability percentage
- SLA target vs. actual
- Notable incidents
- Trend analysis
Detailed Metrics:
- Availability per tier
- Latency percentiles (P50, P95, P99)
- Error rates
- GPU utilization
- Backup success rates
Incident Analysis:
- Number of incidents
- Mean Time To Detect (MTTD)
- Mean Time To Resolve (MTTR)
- Root causes
- Preventive actions
Error Budget:
- Starting budget
- Budget consumed
- Remaining budget
- Burn rate trend
# Check current month SLA status
themisdb-cli sla status --tier tier1 --month current
# Calculate availability for specific date range
themisdb-cli sla calculate \
--start "2026-01-01" \
--end "2026-01-31" \
--tier tier1
# Export SLA data for analysis
themisdb-cli sla export \
--month 2026-01 \
--format csv \
--output /tmp/sla-data-2026-01.csvWhen SLA alert fires:
# 1. Acknowledge alert
themisdb-cli alert ack --alert-id <alert-id>
# 2. Assess impact
themisdb-cli sla impact-assessment
# Expected output:
# Current Availability: 99.7% (target: 99.9%)
# Error Budget: 45% remaining
# Affected Services: Tier 1 inference
# Estimated Users Impacted: 1,234
# 3. Follow incident runbook
# See RUNBOOKS.md for specific procedures
# 4. Track SLA impact during incident
themisdb-cli sla track-incident \
--incident-id INC-2026-001 \
--real-time
# 5. After resolution, document SLA impact
themisdb-cli sla incident-report \
--incident-id INC-2026-001 \
--output /reports/inc-2026-001-sla-impact.pdfError budget policies:
# /etc/themisdb/error-budget-policy.yaml
error_budget_policy:
tier1:
freeze_deployments_at: 0.25 # 25% remaining
high_alert_at: 0.50
warning_at: 0.75
actions:
- threshold: 0.25
action: freeze_all_deployments
notification: critical
- threshold: 0.50
action: freeze_risky_deployments
notification: warning
- threshold: 0.75
action: increase_monitoring
notification: infoCheck error budget:
# View current error budget
themisdb-cli sla error-budget --tier tier1
# Expected output:
# Error Budget Status:
# Target: 99.9% availability (0.1% error budget)
# Current: 99.92% availability
# Budget Remaining: 78%
# Burn Rate: 2.3% per week
# Projected Budget End: 2026-03-15
# If budget low, trigger policy
BUDGET_REMAINING=$(themisdb-cli sla error-budget --tier tier1 --json | jq -r '.remaining')
if (( $(echo "$BUDGET_REMAINING < 0.25" | bc -l) )); then
themisdb-cli deployment freeze --reason "Error budget exhausted"
fi| Metric | Definition | Target (Tier 1) |
|---|---|---|
| Availability | % of time service is operational | 99.9% |
| P50 Latency | 50th percentile request latency | < 50ms |
| P95 Latency | 95th percentile request latency | < 100ms |
| P99 Latency | 99th percentile request latency | < 200ms |
| Error Rate | % of failed requests | < 0.1% |
| MTTD | Mean Time To Detect | < 5 min |
| MTTR | Mean Time To Resolve | < 30 min |
| GPU Utilization | % of GPU compute used | > 85% |
Document Version: 1.0
Last Updated: April 2026
Next Review: April 2026
Owner: SRE Team