Cut PyTorch model loading time from 15s to 0.2s with zero-copy shared memory caching.
Overmind is a non-intrusive caching library that dramatically speeds up PyTorch model loading by storing serialized models in shared memory. Once a model is loaded, subsequent loads from any process take milliseconds instead of seconds.
Named after the Overmind from StarCraft, it coordinates model caching across processes like the Overmind coordinates the Zerg Swarm.
Note that the package name on PyPI is overmind-cache, since overmind is taken.
- Fast model loading - First load caches to shared memory; subsequent loads are ~5x faster
- Process-agnostic - Cache persists across process restarts via a background server
- Non-intrusive - Just add one line of code; no changes to model loading logic
- Memory efficient - Multiple processes share the same cached tensors in memory
- Broad compatibility - Works with
diffusers,transformers,bitsandbytesquantization, and vanillatorch.load
pip install overmind-cacheOr install from source:
git clone https://github.com/taichi-dev/overmind.git
cd overmind
pip install -e .Add a single line at the top of your script to automatically accelerate all supported model loading:
import overmind.api
overmind.api.monkey_patch_all()
# Your existing code works unchanged!
from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16
)
pipeline.to('cuda')
# First run: ~24s
# Subsequent runs: ~1s (mainly consumed by .to('cuda'))For the ones don't like monkey-patching, use the load function directly:
from overmind.api import load
from diffusers import DiffusionPipeline
pipeline = load(
DiffusionPipeline.from_pretrained,
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16
)Overmind automatically patches these loading functions:
| Library | Functions |
|---|---|
| Diffusers | DiffusionPipeline.from_pretrained, ModelMixin.from_pretrained, SchedulerMixin.from_pretrained, FromSingleFileMixin.from_single_file |
| Transformers | PreTrainedModel.from_pretrained, PreTrainedTokenizerBase.from_pretrained, AutoProcessor.from_pretrained, pipeline |
| PyTorch | torch.load, torch.jit.load |
| Safetensors | safetensors.torch.load_file |
| TorchVision | vgg16, vgg19 |
| OpenCLIP | create_model_and_transforms |
Create an overmind.cfg file in your package root to add custom patch points:
# overmind.cfg
mylib.models::MyModel.from_pretrained
mylib.utils::load_checkpoint
# Start the server manually (usually auto-started)
overmind-server
# Start as daemon
overmind-server --daemon
# List cached models
overmind-list
# Shutdown the server (clears cache)
overmind-shutdown| Variable | Description |
|---|---|
OVERMIND_DISABLE |
Set to any value to disable Overmind, falling back to a local cache |
OVERMIND_NO_LOCAL_CACHE |
Disable local caching too |
Loading a Stable Diffusion ControlNet pipeline with VAE, on Linux + Intel i9-11900K + RTX 4090:
Using demo-vae.py as example:
| Run | vae |
depth |
edge |
pipeline |
to('cuda') | Total |
|---|---|---|---|---|---|---|
| w/o Overmind (2nd+) | 1.18s | 0.98s | 1.41s | 1.65s | 0.91s | 6.16s |
| w/ Overmind (1st) | 5.44s | 5.17s | 5.41s | 7.29s | 0.86s | 24.20s |
| w/ Overmind (2nd+) | 0.00s | 0.01s | 0.01s | 0.20s | 0.87s | 1.12s |
The first load with Overmind is slower due to pickling overhead. Subsequent loads are 5-6x faster than without Overmind, with the only remaining cost being the to('cuda') transfer.
Apache 2.0
Contributions are welcome! Please feel free to submit a Pull Request.
Developed by Taichi Graphics for production AI inference workloads.