🚀 QuantLLM v2.0

The Ultra-Fast LLM Quantization & Export Library

Load → Quantize → Fine-tune → Export — All in One Line

Quick Start • Features • Export Formats • Examples • Documentation

🎯 Why QuantLLM?

❌ Without QuantLLM (50+ lines of code)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)
# Then llama.cpp compilation for GGUF...
# Then manual tensor conversion...

✅ With QuantLLM (4 lines of code)

from quantllm import turbo

model = turbo("meta-llama/Llama-3-8B")     # Auto-quantizes
model.generate("Hello!")                    # Generate text
model.export("gguf", quantization="Q4_K_M") # Export to GGUF

⚡ Quick Start

Installation

# Recommended
pip install git+https://github.com/codewithdark-git/QuantLLM.git

# With all export formats
pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git"

Your First Model

from quantllm import turbo

# Load with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")

# Generate text
response = model.generate("Explain quantum computing simply")
print(response)

# Export to GGUF
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")

QuantLLM automatically:

✅ Detects your GPU and available memory
✅ Applies optimal 4-bit quantization
✅ Enables Flash Attention 2 when available
✅ Configures memory management

✨ Features

🔥 TurboModel API

One unified interface for everything:

model = turbo("mistralai/Mistral-7B")
model.generate("Hello!")
model.finetune(data, epochs=3)
model.export("gguf", quantization="Q4_K_M")
model.push("user/repo", format="gguf")

⚡ Performance Optimizations

Flash Attention 2 — Auto-enabled for speed
torch.compile — 2x faster training
Dynamic Padding — 50% less VRAM
Triton Kernels — Fused operations

🧠 45+ Model Architectures

Llama 2/3, Mistral, Mixtral, Qwen 1/2, Phi 1/2/3, Gemma, Falcon, DeepSeek, Yi, StarCoder, ChatGLM, InternLM, Baichuan, StableLM, BLOOM, OPT, MPT, GPT-NeoX...

📦 Multi-Format Export

Format	Use Case	Command
GGUF	llama.cpp, Ollama, LM Studio	`model.export("gguf")`
ONNX	ONNX Runtime, TensorRT	`model.export("onnx")`
MLX	Apple Silicon (M1/M2/M3/M4)	`model.export("mlx")`
SafeTensors	HuggingFace	`model.export("safetensors")`

🎨 Beautiful Console UI

╔════════════════════════════════════════════════════════════╗
║   🚀 QuantLLM v2.0.0                                       ║
║   Ultra-fast LLM Quantization & Export                     ║
║   ✓ GGUF  ✓ ONNX  ✓ MLX  ✓ SafeTensors                     ║
╚════════════════════════════════════════════════════════════╝

📊 Model: meta-llama/Llama-3.2-3B
   Parameters: 3.21B
   Memory: 6.4 GB → 1.9 GB (70% saved)

🤗 One-Click Hub Publishing

Auto-generates model cards with YAML frontmatter, usage examples, and "Use this model" button:

model.push("user/my-model", format="gguf", quantization="Q4_K_M")

📦 Export Formats

Export to any deployment target with a single line:

from quantllm import turbo

model = turbo("microsoft/phi-3-mini")

# GGUF — For llama.cpp, Ollama, LM Studio
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")

# ONNX — For ONNX Runtime, TensorRT  
model.export("onnx", "./model-onnx/")

# MLX — For Apple Silicon Macs
model.export("mlx", "./model-mlx/", quantization="4bit")

# SafeTensors — For HuggingFace
model.export("safetensors", "./model-hf/")

GGUF Quantization Types

Type	Bits	Quality	Use Case
`Q2_K`	2-bit	🔴 Low	Minimum size
`Q3_K_M`	3-bit	🟠 Fair	Very constrained
`Q4_K_M`	4-bit	🟢 Good	Recommended ⭐
`Q5_K_M`	5-bit	🟢 High	Quality-focused
`Q6_K`	6-bit	🔵 Very High	Near-original
`Q8_0`	8-bit	🔵 Excellent	Best quality

🎮 Examples

Chat with Any Model

from quantllm import turbo

model = turbo("meta-llama/Llama-3.2-3B")

# Simple generation
response = model.generate(
    "Write a Python function for fibonacci",
    max_new_tokens=200,
    temperature=0.7,
)
print(response)

# Chat format
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "How do I read a file in Python?"},
]
response = model.chat(messages)
print(response)

Load GGUF Models

from quantllm import TurboModel

model = TurboModel.from_gguf(
    "TheBloke/Llama-2-7B-Chat-GGUF", 
    filename="llama-2-7b-chat.Q4_K_M.gguf"
)
print(model.generate("Hello!"))

Fine-Tune with Your Data

from quantllm import turbo

model = turbo("mistralai/Mistral-7B")

# Simple training
model.finetune("training_data.json", epochs=3)

# Advanced configuration
model.finetune(
    "training_data.json",
    epochs=5,
    learning_rate=2e-4,
    lora_r=32,
    lora_alpha=64,
    batch_size=4,
)

Supported data formats:

[
  {"instruction": "What is Python?", "output": "Python is..."},
  {"text": "Full text for language modeling"},
  {"prompt": "Question", "completion": "Answer"}
]

Push to HuggingFace Hub

from quantllm import turbo

model = turbo("meta-llama/Llama-3.2-3B")

# Push with auto-generated model card
model.push(
    "your-username/my-model",
    format="gguf",
    quantization="Q4_K_M",
    license="apache-2.0"
)

💻 Hardware Requirements

Configuration	GPU VRAM	Recommended Models
🟢 Entry	6-8 GB	1-7B (4-bit)
🟡 Mid-Range	12-24 GB	7-30B (4-bit)
🔴 High-End	24-80 GB	70B+

Tested GPUs: RTX 3060/3070/3080/3090/4070/4080/4090, A100, H100, Apple M1/M2/M3/M4

📦 Installation Options

# Basic
pip install git+https://github.com/codewithdark-git/QuantLLM.git

# With specific features
pip install "quantllm[gguf]"     # GGUF export
pip install "quantllm[onnx]"     # ONNX export  
pip install "quantllm[mlx]"      # MLX export (Apple Silicon)
pip install "quantllm[triton]"   # Triton kernels
pip install "quantllm[full]"     # Everything

🏗️ Project Structure

quantllm/
├── core/                    # Core API
│   ├── turbo_model.py      # TurboModel unified API
│   └── smart_config.py     # Auto-configuration
├── quant/                   # Quantization
│   └── llama_cpp.py        # GGUF conversion
├── hub/                     # HuggingFace
│   ├── hub_manager.py      # Push/pull models
│   └── model_card.py       # Auto model cards
├── kernels/                 # Custom kernels
│   └── triton/             # Fused operations
└── utils/                   # Utilities
    └── progress.py         # Beautiful UI

🤝 Contributing

git clone https://github.com/codewithdark-git/QuantLLM.git
cd QuantLLM
pip install -e ".[dev]"
pytest

Areas for contribution:

🆕 New model architectures
🔧 Performance optimizations
📚 Documentation
🐛 Bug fixes

📜 License

MIT License — see LICENSE for details.

Made with 🧡 by Dark Coder

⭐ Star on GitHub • 🐛 Report Bug • 💖 Sponsor

Happy Quantizing! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
.github		.github
docs		docs
examples		examples
quantllm		quantllm
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTE.md		CONTRIBUTE.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Uh oh!

License

codewithdark-git/QuantLLM

Folders and files

Latest commit

History

Repository files navigation

🚀 QuantLLM v2.0

🎯 Why QuantLLM?

❌ Without QuantLLM (50+ lines of code)

✅ With QuantLLM (4 lines of code)

⚡ Quick Start

Installation

Your First Model

✨ Features

🔥 TurboModel API

⚡ Performance Optimizations

🧠 45+ Model Architectures

📦 Multi-Format Export

🎨 Beautiful Console UI

🤗 One-Click Hub Publishing

📦 Export Formats

GGUF Quantization Types

🎮 Examples

Chat with Any Model

Load GGUF Models

Fine-Tune with Your Data

Push to HuggingFace Hub

💻 Hardware Requirements

📦 Installation Options

🏗️ Project Structure

🤝 Contributing

📜 License

Made with 🧡 by Dark Coder

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages