A PyTorch implementation for training transformer language models on large text datasets like SlimPajama.
- Transformer Model: Transformer architecture with configurable parameters
- Training Configuration: Clean configuration object for all hyperparameters
- SlimPajama Dataset: Support for the SlimPajama-627B dataset with efficient data loading
- Training Loop: Complete training implementation with loss tracking and evaluation
- Multi-process Data Loading: Efficient data loading with separate processes
- Neptune Integration: Optional experiment tracking with Neptune
- Chinchilla Scaling: Automatic computation of optimal numer of training steps
Install the required dependencies:
pip install -r requirements.txtFirst, download and prepare the SlimPajama dataset:
# Download training data
python download_data.py --dataset slimpajama --split train
# Download validation data
python download_data.py --dataset slimpajama --split validationThis will create data/slimpajama_train/ and data/slimpajama_validation/ directories with the processed JSONL files.
Run the language model training:
# Training without Neptune logging
python language_model_training.py --no_neptune --description "Local training"
# Use different model configuration
python language_model_training.py --model_config chinchilla-44m --description "Small model test"
# Profile mode (short run for testing)
python language_model_training.py --profile_only- Data Loading: Loads SlimPajama dataset with tokenization and batching
- Model Creation: Initializes a transformer model with specified configuration
- Training: Runs training loop with AdamW optimizer and learning rate scheduling
- Monitoring: Tracks loss, learning rate, and performance metrics during training
- Evaluation: Periodic evaluation on validation data
- Experiment Tracking: Optional Neptune integration for experiment management
You can modify the hyperparameters in the run() function in language_model_training.py:
batch_size: Training batch sizesequence_length: Maximum sequence lengthlearning_rate: Learning rate for AdamW optimizerwarmup_steps: Number of warmup steps for learning rate schedulemodel_config: Transformer architecture configuration