Intelligent network traffic forecasting using state-of-the-art time series analysis on the CESNET-TimeSeries-2023-2024 dataset
Features β’ Quick Start β’ Documentation β’ Dataset β’ Examples β’ Contributing
- Overview
- Key Features
- Architecture
- Dataset Structure
- Installation
- Quick Start
- Usage Examples
- Model Configuration
- Evaluation Metrics
- Project Structure
- Advanced Usage
- Performance & Benchmarks
- Contributing
- License
- Acknowledgments
- Citation
TimeSeries-NetTraffic-Engine is a production-ready, enterprise-grade framework for network traffic forecasting and analysis using advanced time series modeling techniques. Built on top of the comprehensive CESNET-TimeSeries-2023-2024 dataset, this framework leverages SARIMA (Seasonal AutoRegressive Integrated Moving Average) models to predict network behavior patterns with high accuracy.
This framework provides researchers, network engineers, and data scientists with powerful tools to:
- Forecast Network Traffic: Predict future network behavior using historical patterns
- Analyze Time Series: Understand temporal patterns in network metrics across different aggregation levels
- Evaluate Performance: Comprehensive evaluation using RMSE, SMAPE, and RΒ² metrics
- Scale Analysis: Process multiple IP addresses, institutions, and subnets simultaneously
- Automated Retraining: Implement sliding window approaches for continuous model improvement
- Network Capacity Planning: Predict bandwidth requirements and optimize infrastructure
- Anomaly Detection: Identify unusual traffic patterns by comparing predictions with actual values
- Resource Optimization: Allocate network resources efficiently based on forecasted demand
- Security Analytics: Detect potential DDoS attacks or unusual traffic patterns
- SLA Management: Ensure service level agreements through predictive maintenance
- π Multi-Scale Analysis: Support for 10-minute, 1-hour, and 1-day aggregation intervals
- π Automated Retraining: Sliding window approach with configurable training and testing periods
- π 18 Network Metrics: Comprehensive coverage including flows, packets, bytes, ASN diversity, port diversity, and TCP/UDP ratios
- π― High-Performance Forecasting: SARIMA model with optimized hyperparameters
- π Missing Value Handling: Intelligent gap-filling strategies for time series continuity
- π Multi-Dataset Support: Works with IP addresses, institutions, and institution subnets
- π Visualization Suite: Rich plotting capabilities for exploratory data analysis
- β‘ Parallel Processing: Metacentrum scripts for large-scale batch processing
- Reproducible Research: Clear documentation of all preprocessing and modeling steps
- Scalable Architecture: Designed for processing thousands of time series
- Flexible Configuration: Easy customization of model parameters and evaluation settings
- Production-Ready Code: Clean, well-documented, and maintainable codebase
- Comprehensive Evaluation: Multiple statistical metrics for robust performance assessment
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CESNET Time Series Dataset β
β ββββββββββββββ βββββββββββββββ ββββββββββββββββββββββββ β
β βIP Addressesβ β Institutionsβ β Institution Subnets β β
β ββββββββββββββ βββββββββββββββ ββββββββββββββββββββββββ β
ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Preprocessing & Gap Filling β
β β’ Missing value imputation β
β β’ Ratio metrics normalization (0.5) β
β β’ Temporal alignment β
ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SARIMA Modeling Engine β
β Order: (p=1, d=1, q=1) β
β Seasonal Order: (P=1, D=1, Q=1, M=168) β
β Training: 744 hours (31 days) β
β Testing: 168 hours (7 days) β
ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Sliding Window Retraining Loop β
β β’ Train on historical window β
β β’ Forecast next period β
β β’ Slide window forward β
β β’ Repeat until dataset exhausted β
ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Evaluation & Analysis β
β β’ RMSE (Root Mean Squared Error) β
β β’ SMAPE (Symmetric Mean Absolute Percentage Error) β
β β’ RΒ² Score (Coefficient of Determination) β
β β’ Statistical distributions β
β β’ Aggregate statistics β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The framework works with the comprehensive CESNET network traffic dataset containing:
| Dataset Part | Description | Granularity |
|---|---|---|
| IP Addresses (Sample) | Representative sample of individual IP addresses | Individual hosts |
| IP Addresses (Full) | Complete set of monitored IP addresses | Individual hosts |
| Institutions | Aggregated traffic per institution | Organizational level |
| Institution Subnets | Traffic per institution subnet | Network segment level |
- 10 Minutes: High-resolution, short-term pattern analysis
- 1 Hour: Medium-resolution, ideal for daily pattern detection
- 1 Day: Low-resolution, long-term trend analysis
| Category | Metrics |
|---|---|
| Volume | n_flows, n_packets, n_bytes |
| ASN Diversity | sum_n_dest_asn, average_n_dest_asn, std_n_dest_asn |
| Port Diversity | sum_n_dest_ports, average_n_dest_ports, std_n_dest_ports |
| IP Diversity | sum_n_dest_ip, average_n_dest_ip, std_n_dest_ip |
| Protocol Ratios | tcp_udp_ratio_packets, tcp_udp_ratio_bytes |
| Direction Ratios | dir_ratio_packets, dir_ratio_bytes |
| Flow Characteristics | avg_duration, avg_ttl |
- Python: 3.10.12 or higher
- pip: Latest version
- Operating System: Linux, macOS, or Windows
- RAM: Minimum 8GB (16GB recommended for large-scale analysis)
# Clone the repository
git clone https://github.com/KUNALSHAWW/TimeSeries-NetTraffic-Engine.git
cd TimeSeries-NetTraffic-Engine
# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install required dependencies
pip install pandas==2.2.2 \
numpy==1.24.4 \
matplotlib==3.8.0 \
scikit-learn==1.5.0 \
statsmodels==0.14.1 \
seaborn==0.13.0# Create requirements.txt
cat > requirements.txt << EOF
pandas==2.2.2
numpy==1.24.4
matplotlib==3.8.0
scikit-learn==1.5.0
statsmodels==0.14.1
seaborn==0.13.0
EOF
# Install from requirements
pip install -r requirements.txtimport pandas as pd
import matplotlib.pyplot as plt
# Load time series data
df_times = pd.read_csv('cesnet-time-series-2023-2024/times/times_1_hour.csv')
df_times['time'] = pd.to_datetime(df_times['time'])
# Load network traffic data
df = pd.read_csv('cesnet-time-series-2023-2024/ip_addresses_sample/agg_1_hour/1/103.csv')
# Visualize n_flows metric
plt.figure(figsize=(15, 5))
plt.plot(df_times['time'], df['n_flows'])
plt.title('Network Flows Over Time')
plt.xlabel('Time')
plt.ylabel('Number of Flows')
plt.show()Launch the interactive example notebook:
jupyter notebook example.ipynbThis notebook provides:
- β Dataset loading and preprocessing
- β Time series visualization
- β SARIMA model training
- β Forecasting and evaluation
python sarima_retraining.py \
-p 1 -d 1 -q 1 \
-P 1 -D 1 -Q 1 -M 168 \
-t 744 -T 168 \
--dataset ip_addresses_sample \
--aggregation agg_1_hour \
--metric n_flows \
--id_ip 1/103.csvParameters Explained:
-p, -d, -q: ARIMA order (p=AR order, d=differencing, q=MA order)-P, -D, -Q, -M: Seasonal ARIMA order (M=seasonal period)-t: Training period (744 hours = 31 days)-T: Testing period (168 hours = 7 days)--dataset: Dataset part to use--aggregation: Temporal aggregation level--metric: Network metric to forecast--id_ip: Specific IP/entity identifier
For processing multiple time series in parallel on HPC clusters:
# Process all IP addresses in sample dataset
./metacentrum_scripts/sarima_retraining_ip_addresses_sample.sh
# Process all institutions
./metacentrum_scripts/sarima_retraining_institutions.sh
# Process all institution subnets
./metacentrum_scripts/sarima_retraining_institution_subnets.shfrom statsmodels.tsa.statespace.sarimax import SARIMAX
import pandas as pd
import numpy as np
# Configuration
ORDER = (1, 1, 1)
SEASONAL_ORDER = (1, 1, 1, 168)
TRAINING_PERIOD = 744
TESTING_PERIOD = 168
# Load and prepare data
df = pd.read_csv('your_time_series.csv')
train_data = df['n_flows'][:TRAINING_PERIOD]
# Train SARIMA model
model = SARIMAX(train_data, order=ORDER, seasonal_order=SEASONAL_ORDER)
results = model.fit(disp=False)
# Forecast
forecast = results.forecast(steps=TESTING_PERIOD)
# Evaluate
from sklearn.metrics import root_mean_squared_error
rmse = root_mean_squared_error(df['n_flows'][TRAINING_PERIOD:TRAINING_PERIOD+TESTING_PERIOD], forecast)
print(f'RMSE: {rmse:.2f}')import matplotlib.pyplot as plt
metrics = ['n_flows', 'n_packets', 'n_bytes']
fig, axes = plt.subplots(len(metrics), 1, figsize=(15, 12))
for idx, metric in enumerate(metrics):
axes[idx].plot(df['time'], df[metric])
axes[idx].set_title(f'{metric} Over Time')
axes[idx].set_ylabel(metric)
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()The default configuration is optimized for hourly network traffic data:
{
"order": (1, 1, 1), # (p, d, q) - ARIMA order
"seasonal_order": (1, 1, 1, 168), # (P, D, Q, M) - Seasonal ARIMA
"training_period": 744, # 31 days in hours
"testing_period": 168, # 7 days in hours
"retraining_stride": 168 # Retrain every 7 days
}| Parameter | Description | Typical Range | Notes |
|---|---|---|---|
p |
AR order | 0-5 | Number of lag observations |
d |
Differencing order | 0-2 | Number of differences for stationarity |
q |
MA order | 0-5 | Size of moving average window |
P |
Seasonal AR order | 0-2 | Seasonal autoregressive order |
D |
Seasonal differencing | 0-1 | Seasonal differencing degree |
Q |
Seasonal MA order | 0-2 | Seasonal moving average order |
M |
Seasonal period | 24, 168, 8760 | Hours in day/week/year |
Measures the standard deviation of prediction errors.
Lower is better β’ Sensitive to outliers β’ Same units as target variable
Percentage-based metric treating over/under-estimation equally.
Range: 0-100% β’ 0% = perfect β’ Symmetric β’ Scale-independent
Proportion of variance in the target variable explained by the model.
Range: -β to 1 β’ 1 = perfect prediction β’ 0 = baseline model β’ Negative = worse than baseline
TimeSeries-NetTraffic-Engine/
β
βββ π example.ipynb # Interactive tutorial notebook
βββ π analyze-results.ipynb # Results analysis and visualization
βββ π sarima_retraining.py # CLI tool for SARIMA retraining
β
βββ π metacentrum_scripts/ # HPC batch processing scripts
β βββ run_ip_addresses_sample.sh
β βββ run_institutions.sh
β βββ run_institution_subnets.sh
β βββ sarima_retraining_ip_addresses_sample.sh
β βββ sarima_retraining_institutions.sh
β βββ sarima_retraining_institution_subnets.sh
β
βββ π cesnet-time-series-2023-2024/ # Dataset directory (not included)
β βββ times/ # Timestamp files
β βββ ip_addresses_sample/ # Sample IP dataset
β βββ ip_addresses_full/ # Full IP dataset
β βββ institutions/ # Institution-level data
β βββ institution_subnets/ # Subnet-level data
β
βββ π results/ # Output directory for predictions
β βββ sarima-retraining/
β βββ results/
β
βββ π LICENSE # BSD 3-Clause License
βββ π README.md # This file
def custom_fill_missing(train_df, train_time_ids, strategy='mean'):
"""
Custom missing value imputation strategy
Args:
train_df: Training dataframe
train_time_ids: Expected time IDs
strategy: 'mean', 'median', 'zero', or 'forward_fill'
"""
df_missing = pd.DataFrame(columns=train_df.columns)
df_missing.id_time = train_time_ids[~train_time_ids.isin(train_df.id_time)].values
for column in train_df.columns:
if column == "id_time":
continue
if strategy == 'mean':
df_missing[column] = train_df[column].mean()
elif strategy == 'median':
df_missing[column] = train_df[column].median()
elif strategy == 'zero':
df_missing[column] = 0
# Add more strategies as needed
return pd.concat([train_df, df_missing]).sort_values(by="id_time").reset_index()[train_df.columns]# Forecast all metrics for a single time series
metrics = ['n_flows', 'n_packets', 'n_bytes']
predictions = {}
for metric in metrics:
model = SARIMAX(df[metric], order=(1,1,1), seasonal_order=(1,1,1,168))
results = model.fit(disp=False)
predictions[metric] = results.forecast(steps=168)
# Create prediction dataframe
predictions_df = pd.DataFrame(predictions)from joblib import Parallel, delayed
def process_time_series(file_path, metric):
"""Process a single time series"""
df = pd.read_csv(file_path)
# ... training and prediction logic
return predictions
# Process multiple files in parallel
results = Parallel(n_jobs=-1)(
delayed(process_time_series)(file, 'n_flows')
for file in file_list
)| Operation | Time (Avg) | Memory | Notes |
|---|---|---|---|
| Load 1-hour dataset | ~2 seconds | 50 MB | Per IP address |
| SARIMA training (744 points) | ~5-10 seconds | 200 MB | Single metric |
| Forecast (168 points) | ~1 second | 50 MB | Using fitted model |
| Complete retraining cycle | ~2-5 minutes | 500 MB | Full year, single metric |
- Single IP Address: ~5 minutes for full analysis (all metrics)
- 100 IP Addresses: ~8 hours (with parallel processing)
- 1000 IP Addresses: ~3 days (recommended: HPC cluster)
| Scale | CPU | RAM | Storage |
|---|---|---|---|
| Small (< 100 time series) | 4 cores | 8 GB | 10 GB |
| Medium (100-1000 time series) | 16 cores | 32 GB | 50 GB |
| Large (1000+ time series) | 32+ cores | 64+ GB | 200+ GB |
We welcome contributions from the community! Here's how you can help:
- π Report Bugs: Open an issue with detailed reproduction steps
- π‘ Suggest Features: Share your ideas for improvements
- π Improve Documentation: Help make our docs clearer
- π§ Submit Pull Requests: Contribute code improvements
# Fork and clone the repository
git clone https://github.com/KUNALSHAWW/TimeSeries-NetTraffic-Engine.git
cd TimeSeries-NetTraffic-Engine
# Create a development branch
git checkout -b feature/your-feature-name
# Make your changes and test thoroughly
# ...
# Commit with clear messages
git commit -m "Add: Description of your changes"
# Push to your fork
git push origin feature/your-feature-name
# Open a Pull Request on GitHub- Follow PEP 8 for Python code
- Add docstrings to all functions
- Include type hints where applicable
- Write unit tests for new features
- Update documentation for API changes
This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.
Copyright (c) 2024, CESNET
All rights reserved.
- pandas: BSD 3-Clause License
- NumPy: BSD License
- scikit-learn: BSD 3-Clause License
- statsmodels: BSD License
- matplotlib: PSF License
Special thanks to CESNET for providing the comprehensive CESNET-TimeSeries-2023-2024 dataset, which makes this research and development possible.
This work builds upon established research in:
- Time series forecasting
- Network traffic analysis
- SARIMA modeling
- Statistical learning
Built with β€οΈ using:
- pandas - Data manipulation
- NumPy - Numerical computing
- scikit-learn - Machine learning
- statsmodels - Statistical models
- matplotlib - Visualization
- seaborn - Statistical visualization
If you use this framework in your research, please cite:
@software{timeseries_nettraffic_engine,
title = {TimeSeries-NetTraffic-Engine: Advanced Network Traffic Time Series Forecasting Framework},
author = {Kunal Shaw},
year = {2024},
url = {https://github.com/KUNALSHAWW/TimeSeries-NetTraffic-Engine},
note = {Based on CESNET-TimeSeries-2023-2024 dataset}
}- π§ Email: kunalshawkol17@gmail.com
- π¬ Discussions: GitHub Discussions
- π Issues: GitHub Issues
- π LinkedIn: Your LinkedIn
- β SARIMA forecasting implementation
- β Multi-dataset support
- β Comprehensive evaluation metrics
- β Jupyter notebooks for exploration
- β HPC batch processing scripts
- π LSTM/GRU deep learning models
- π Prophet integration
- π Real-time forecasting API
- π Web-based visualization dashboard
- π Automated hyperparameter tuning
- π Anomaly detection module
- π Multi-variate forecasting
- π Ensemble methods
- π Transfer learning across datasets
- π Edge deployment capabilities
- π Integration with network monitoring tools
If you find this project useful, please consider giving it a star β on GitHub!
Made with β€οΈ by Data Scientists, for Data Scientists