Thank you for your interest in contributing to the AI Systems Performance Engineering repository! This guide will help you get started with contributing code, documentation, examples, and improvements.
We welcome contributions from the community in many forms:
- Code Examples: New CUDA kernels, PyTorch optimizations, performance scripts
- Documentation: Improvements to README files, code comments, tutorials
- Performance Optimizations: Better algorithms, memory optimizations, profiling tools
- Bug Fixes: Issues with existing code, compatibility problems
- Architecture Support: Extend Blackwell workflows or add tooling for new GPU families
- Testing: Unit tests, performance benchmarks, validation scripts
- NVIDIA GPU with CUDA support
- Python 3.8+
- PyTorch with CUDA
- Git
# Fork and clone the repository
git clone https://github.com/your-username/ai-performance-engineering.git
cd ai-performance-engineering
# Create a new branch for your contribution
git checkout -b feature/your-feature-name
# Install development dependencies
pip install -r code/ch1/requirements.txt- Python: Follow PEP 8 style guidelines
- CUDA: Use consistent naming conventions and proper error handling
- Shell Scripts: Use bash with proper error handling (
set -e) - Comments: Add clear, descriptive comments for complex logic
- New Examples: Place in appropriate chapter directory (
code/chX/) - Tools: Add to
tools/directory - Scripts: Add to
scripts/or the relevant chapter directory - Documentation: Update relevant README files
The main branch targets Blackwell B200/B300 (SM100) exclusively. New examples should default to ARCH ?= sm_100 and inherit the CUDA 12.9 toolchain. If you prototype support for other GPUs, keep it behind clearly documented flags or submit it as a separate branch.
- Create new CUDA kernels or PyTorch optimizations
- Add performance profiling scripts
- Implement new algorithms or techniques
- Improve README files with better explanations
- Add code comments and docstrings
- Create tutorials or guides
- Optimize existing code for better performance
- Add new profiling tools
- Improve memory usage or compute efficiency
# Make your changes
# Test your code thoroughly
# Run tests (if applicable)
python -m pytest tests/
# Check code style
black code/
flake8 code/# Run performance benchmarks
./code/build_all.sh
# Profile your changes
python scripts/profile_harness.py --profile nsys --profile pytorch --examples ch6_add_parallel --output-root profiles/test_run
# Compare with baseline
python tools/comprehensive_profiling.py- Confirm runs on Blackwell B200/B300 hardware
- Verify PyTorch 2.9 nightly/cu129 environment
- Ensure CUDA 12.9 toolkit compatibility
# Add your changes
git add .
# Commit with descriptive message
git commit -m "Add new CUDA kernel for memory optimization
- Implements coalesced memory access pattern
- Targets NVIDIA Blackwell B200/B300
- Includes performance benchmarks
- Adds comprehensive documentation"
# Push to your fork
git push origin feature/your-feature-name- Test thoroughly on Blackwell hardware (or simulator)
- Update documentation if needed
- Add comments for complex code
- Include performance benchmarks for optimizations
- Follow naming conventions and code style
- Update relevant README files
## Description
Brief description of your changes
## Type of Change
- [ ] New feature (code example, optimization)
- [ ] Bug fix
- [ ] Documentation update
- [ ] Performance improvement
- [ ] Blackwell workflow improvement
## Testing
- [ ] Tested on Blackwell B200/B300 (sm_100)
- [ ] Performance benchmarks included
- [ ] Documentation updated
## Performance Impact
- **Before**: [baseline metrics]
- **After**: [improved metrics]
- **Improvement**: [percentage/description]
## Additional Notes
Any additional context or considerationsWhen adding support for new GPU architectures:
- Update architecture detection scripts
- Add new architecture constants
- Test on target hardware
- Update documentation
If you experiment with additional architectures, document the changes clearly and avoid regressing the default Blackwell workflow. Consider maintaining separate branches for architecture-specific divergences to keep main lean.
- Baseline: Always include baseline performance
- Multiple Runs: Run benchmarks multiple times
- Hardware Specs: Document test hardware
- Environment: Specify CUDA/PyTorch versions
# Performance benchmark example
import time
import torch
def benchmark_kernel():
# Setup
device = torch.device('cuda')
size = 1024 * 1024
# Warmup
for _ in range(10):
# Your kernel here
pass
# Benchmark
start = time.time()
for _ in range(100):
# Your kernel here
pass
end = time.time()
# Report
avg_time = (end - start) / 100
throughput = size / avg_time
print(f"Average time: {avg_time:.6f}s")
print(f"Throughput: {throughput:.2f} ops/s")When reporting bugs, please include:
- Hardware: GPU model, driver version
- Software: CUDA version, PyTorch version
- Steps: Clear reproduction steps
- Expected vs Actual: What you expected vs what happened
- Logs: Error messages and logs
## Bug Description
Clear description of the issue
## Steps to Reproduce
1. Step 1
2. Step 2
3. Step 3
## Expected Behavior
What you expected to happen
## Actual Behavior
What actually happened
## Environment
- GPU: [Model]
- CUDA: [Version]
- PyTorch: [Version]
- OS: [Version]
## Additional Context
Any other relevant informationWhen updating documentation:
- Clarity: Make explanations clear and concise
- Examples: Include practical code examples
- Links: Add relevant links and references
- Structure: Maintain consistent formatting
- Purpose: Explain what the code does
- Parameters: Document function parameters
- Returns: Document return values
- Complexity: Explain complex algorithms
- New CUDA Kernels: Optimized implementations
- PyTorch Optimizations: Framework-specific improvements
- Profiling Tools: Better performance analysis
- Architecture Support: New GPU compatibility
- Documentation: Tutorials and guides
- Memory Optimization: New memory access patterns
- Kernel Fusion: Combining multiple operations
- Tensor Core Usage: Optimized matrix operations
- Stream Management: Better asynchronous execution
- Distributed Training: Multi-GPU optimizations
- Issues: Use GitHub issues for questions
- Discussions: Start discussions for ideas
- Meetups: Join our monthly meetups
- YouTube: Check our video tutorials
- GitHub Issues: For bugs and feature requests
- Discussions: For questions and ideas
- Email: For private or sensitive matters
By contributing to this project, you agree that your contributions will be licensed under the same license as the project (MIT License).
Contributors will be recognized in:
- README.md: For significant contributions
- Release Notes: For each release
- Documentation: In relevant sections
- Community: In meetups and presentations
Thank you for contributing to the AI Performance Engineering community.