Version: 1.8.0-rc1
Last Updated: April 2026
Purpose: Ensure production readiness before ThemisDB GPU deployment
- GPU models verified against compatibility matrix
- GPU drivers installed (version 525.x+ for RTX/A100/H100)
- All GPUs detected by
nvidia-smi - GPU memory capacity meets requirements
- GPU topology verified (NVLink/PCIe connections)
- GPU persistence mode enabled (
nvidia-smi -pm 1) - Application clocks set if applicable
- ECC memory enabled on supported GPUs
- GPU thermal management adequate (cooling, fans)
- Power supply sufficient for total GPU TDP
- CPU: Minimum 8 cores (32+ for multi-GPU)
- RAM: Minimum 64GB (256GB+ for multi-GPU)
- Storage: NVMe SSD with sufficient capacity
- Network: 10 GbE or faster (25 GbE/InfiniBand for multi-node)
- Network switches support required bandwidth
- Adequate rack space and power circuits
- UPS configured for graceful shutdown
- Hardware monitoring in place (IPMI, BMC)
- OS: Ubuntu 22.04 LTS, CentOS 8, or RHEL 8/9
- Kernel version 5.x or later
- OS fully patched and updated
- Time synchronization configured (NTP/chrony)
- System logging operational
- SELinux/AppArmor configured appropriately
- NVIDIA driver installed and verified
- CUDA Toolkit 12.3+ installed
- cuDNN 9.x installed
-
nvcc --versionshows correct CUDA version - CUDA samples compile and run successfully
- LD_LIBRARY_PATH configured correctly
- NVIDIA Container Toolkit installed (if using Docker)
- RocksDB installed or bundled
- OpenSSL 1.1.1+ installed
- Build tools installed (gcc/g++ 11+, cmake 3.20+)
- Python 3.8+ installed (if using Python clients)
- Required system libraries present
- ThemisDB binary/package installed
- Installation directory permissions correct
- Configuration directory created (
/etc/themisdb/) - Data directory created (
/data/themisdb/) - Log directory created (
/var/log/themisdb/) - Checkpoint directory created
- Model directory created
- Service user created (
themisdb) - Service user has GPU access
- Systemd service file installed
- Binary in PATH or symlinked
- Main config file created (
config.yaml) - GPU configuration validated
- Storage paths configured
- Network ports configured
- Logging configured
- Security settings applied
- Resource limits set
- Backup configuration prepared
- Configuration syntax validated
- Configuration backed up
- Port 8080 open for HTTP API
- Port 18765 open for binary protocol
- Port 4318 open for metrics (Prometheus)
- Port 9000 open for distributed coordinator (if multi-node)
- Additional ports for workers (if multi-node)
- Firewall rules documented
- SSH access configured and secured
- Internal network connectivity verified
- Hostname configured correctly
- DNS resolution working
- Reverse DNS configured (if required)
- Network routing verified
- Load balancer configured (if applicable)
- SSL/TLS certificates obtained
- Certificate chain validated
- All nodes can reach coordinator
- InfiniBand configured and tested (if applicable)
- NCCL environment variables set
- Node-to-node bandwidth verified
- GPU Direct RDMA tested (if applicable)
- Network latency acceptable (<1ms within datacenter)
- Data directory on fast storage (NVMe preferred)
- Sufficient disk space (minimum 500GB, 2TB+ recommended)
- Filesystem: ext4 or XFS
- Mount options optimized (noatime)
- Disk encryption configured (if required)
- SMART monitoring enabled
- Disk I/O performance tested
- Separate partition for checkpoints (optional)
- RAID level selected (RAID 5/6 recommended)
- RAID array created and initialized
- RAID monitoring configured (
mdadm) - Hot spare configured
- RAID performance tested
- Rebuild procedure documented
- Backup destination configured
- Backup authentication configured
- Test backup successful
- Test restore successful
- Backup schedule configured
- Backup retention policy set
- Off-site backup configured (recommended)
- Admin accounts created
- User accounts created
- Strong passwords enforced
- SSH key-based authentication configured
- Multi-factor authentication enabled (recommended)
- RBAC policies configured
- API keys generated and secured
- Service accounts configured
- Password policy enforced
- Disk encryption enabled (LUKS)
- TLS 1.3 configured
- TLS certificates installed
- TLS private keys secured (0600 permissions)
- mTLS configured (if required)
- Application-level encryption enabled
- Encryption keys backed up securely
- HSM integrated (if required)
- Firewall rules minimized (least privilege)
- VPN configured (if remote access needed)
- Network segmentation implemented
- Intrusion detection system configured
- DDoS protection configured (if public-facing)
- Rate limiting configured
- Audit logging enabled
- Log retention configured
- SIEM integration configured (if applicable)
- Compliance requirements reviewed (SOC2, GDPR, HIPAA)
- Security scanning completed
- Vulnerability assessment completed
- Penetration testing completed (for production)
- GPU access control configured
- Per-user GPU allocation defined
- VRAM limits configured
- Compute mode set appropriately
- GPU cgroups configured (if applicable)
- MIG mode configured (A100/H100, if applicable)
- VRAM secure clearing enabled
- GPU-to-GPU communication tested
- NVLink verified (
nvidia-smi topo -m) - PCIe bandwidth verified
- NCCL performance tested
- Load balancing strategy configured
- Failover mechanism configured
- Prometheus installed and configured
- ThemisDB metrics endpoint accessible
- NVIDIA DCGM exporter installed
- Node exporter installed
- Metrics retention configured
- Remote storage configured (optional)
- Grafana installed and configured
- Grafana data sources added
- GPU dashboards imported
- Training dashboards imported
- Inference dashboards imported
- Dashboard alerts configured
- Alertmanager installed and configured
- Alert rules defined
- Email notifications configured
- Slack/Teams integration configured (optional)
- PagerDuty integration configured (optional)
- On-call schedule defined
- Escalation policies defined
- Application logging configured
- Log rotation configured
- Log aggregation configured (ELK/Loki)
- Log retention policy set
- Log search functionality tested
- Batch size calculated for available VRAM
- Learning rate schedule configured
- Gradient accumulation configured
- Mixed precision (FP16/BF16) enabled
- Gradient checkpointing configured
- Flash Attention enabled
- 8-bit optimizer configured (if applicable)
- KV cache optimized
- Dataloader workers optimized
- Baseline performance benchmarks completed
- GPU utilization target met (>85%)
- Training throughput acceptable
- Inference latency acceptable
- Memory usage within limits
- I/O bottlenecks identified and resolved
- Architecture diagram documented
- Network topology documented
- Configuration files documented
- Runbooks prepared
- Troubleshooting guide accessible
- Emergency procedures documented
- Contact information documented
- Service dependencies documented
- Team trained on ThemisDB operations
- Admin access granted to team members
- On-call rotation established
- Escalation procedures defined
- Communication channels established
- Incident response plan reviewed
- Disaster recovery plan prepared
- Unit tests passed
- Integration tests passed
- End-to-end tests passed
- Load testing completed
- Stress testing completed
- Failover testing completed
- Backup/restore tested
- Security testing completed
- Full backup procedure tested
- Incremental backup procedure tested
- Checkpoint backup automated
- Backup verification automated
- Recovery time objective (RTO) defined
- Recovery point objective (RPO) defined
- Backup monitoring configured
- DR plan documented
- DR site identified (if applicable)
- DR runbooks prepared
- DR testing scheduled
- Data replication configured (if applicable)
- Failover procedure tested
- Compliance requirements identified
- Data classification completed
- Data retention policies defined
- Privacy policies reviewed
- Terms of service reviewed
- Export control requirements reviewed
- Industry-specific compliance verified
- Change management process defined
- Change approval workflow established
- Change rollback procedures defined
- Change log/audit trail enabled
- Maintenance windows scheduled
- All checklist items completed
- Stakeholder approval obtained
- Go-live plan reviewed
- Communication plan prepared
- Rollback plan prepared
- Support team on standby
- Monitoring actively watched
- Success criteria defined
- All critical items: ✓
- High-priority items: ✓
- Medium-priority items: Acceptable
- Risk assessment: Acceptable
- Team readiness: Confirmed
- DECISION: GO / NO-GO
Deployment Lead: _________________ Date: _______
Security Lead: _________________ Date: _______
Operations Lead: _________________ Date: _______
Engineering Lead: _________________ Date: _______
Stakeholder Approval: _________________ Date: _______
Use this section to document any deviations, exceptions, or special considerations:
[Notes here]
After completing this checklist:
- Proceed to deployment (DEPLOYMENT.md)
- Complete post-deployment checklist (post_deployment.md)
- Set up monitoring (MONITORING.md)
Checklist Version: 1.0
Last Updated: April 2026