[π§π· PortuguΓͺs] [π¬π§ English]
Dictionary-Based Feature Grouping for LLM/AI Pipelines
Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science
Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva
Note
- Projects and deliverables may be made publicly available whenever possible.
- The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
- All activities comply with the academic and ethical guidelines of PUC-SP.
- Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.
πΆ Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4
πΊ For better resolution, watch the video on YouTube.
Tip
This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.
β Access Data Mining Main Repository
- Overview
- What is Dictionary-Based Feature Grouping?
- Why Use This Technique?
- Key Concepts
- Installation
- Quick Start
- Basic Examples
- Advanced Usage with LLMs
- Real-World Applications
- Project Structure
- Notebooks
- Dataset Resources
- References
- Contact
This repository demonstrates dictionary-based feature grouping for tabular data preprocessing, specifically designed for integration with Large Language Models (LLMs) and AI/ML pipelines.
The technique allows you to organize related columns (features) in a dataset using dictionaries, enabling:
- Semantic grouping of features
- Efficient preprocessing for LLM-based feature engineering
- Better interpretability of tabular data
- Streamlined data transformation pipelines
Tip
Perfect for: Data Scientists, ML Engineers, AI Researchers, and Students working with tabular data and LLMs!
Dictionary-based feature grouping is a data preprocessing technique where you use Python dictionaries to organize and group related columns (features) in a DataFrame based on their semantic meaning or data type.
Imagine you have a dataset about customers with many columns:
age, income, city, state, country, purchase_date, product_name, price, ...
Instead of processing all columns individually, you can group them by meaning:
feature_groups = {
'demographics': ['age', 'income'],
'location': ['city', 'state', 'country'],
'transaction': ['purchase_date', 'product_name', 'price']
}This makes it easier to:
- Apply specific transformations to each group
- Feed organized data to LLMs
- Understand your dataset structure
- Create modular and maintainable code
- π¦ Organized Feature Engineering: Group numerical, categorical, and text features separately
- βοΈ Pipeline Efficiency: Apply different transformers to different feature groups
- π§ Better Understanding: Know which features belong together conceptually
- π€ Semantic Context: LLMs perform better when features are semantically grouped
- π¬ Prompt Engineering: Create structured prompts with organized feature groups
- π Hybrid Models: Combine tabular data with LLM embeddings effectively
- π Feature Generation: Use LLMs to create new features from grouped columns
Core Python/Pandas functionality for splitting, applying, and combining data:
df.groupby('category').agg({'value': 'mean'})Using dictionaries to define feature relationships:
column_mapping = {
'group_name': ['col1', 'col2', 'col3']
}Leveraging LLMs to:
- Generate text embeddings from grouped text columns
- Create semantic features
- Enrich tabular data with contextual information
- Python 3.8+
- pip or conda
# Clone the repository
git clone https://github.com/Quantum-Software-Development/16-DataMining_llm-tabular-preprocessing-dict-groups.git
cd 16-DataMining_llm-tabular-preprocessing-dict-groups
# Install requirements
pip install -r requirements.txt# Build Docker image
docker build -t dict-groups-preprocessing .
# Run container
docker run -p 8888:8888 dict-groups-preprocessingimport pandas as pd
# Sample dataset
data = {
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['NY', 'LA', 'Chicago'],
'salary': [50000, 60000, 70000],
'department': ['IT', 'HR', 'IT']
}
df = pd.DataFrame(data)
# Define feature groups
feature_dict = {
'personal': ['name', 'age'],
'location': ['city'],
'professional': ['salary', 'department']
}
# Process by group
for group_name, columns in feature_dict.items():
print(f"\nProcessing {group_name}:")
print(df[columns].head())Processing personal:
name age
0 Alice 25
1 Bob 30
2 Charlie 35
Processing location:
city
0 NY
1 LA
2 Chicago
Processing professional:
salary department
0 50000 IT
1 60000 HR
2 70000 ITimport pandas as pd
import numpy as np
# Create sample data
df = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Product A', 'Product B', 'Product C'],
'price': [10.5, 20.0, 15.75],
'category': ['Electronics', 'Clothing', 'Electronics']
})
# Group columns by data type
type_groups = {
'numeric': df.select_dtypes(include=[np.number]).columns.tolist(),
'categorical': ['category'],
'text': ['name']
}# Example: Using grouped text features for LLM prompts
text_groups = {
'product_info': ['product_name', 'description'],
'user_feedback': ['reviews']
}
# Create structured prompt for LLM
def create_llm_prompt(row, group_dict):
prompt = ""
for group_name, cols in group_dict.items():
prompt += f"{group_name}: {', '.join([str(row[col]) for col in cols])}\n"
return promptbr>
- E-commerce: Group product features, pricing, and reviews
- Healthcare: Organize patient demographics, vitals, and medical history
- Finance: Separate transaction data, customer info, and risk factors
- NLP: Combine tabular + text data for hybrid models
16-DataMining_llm-tabular-preprocessing-dict-groups/
β
βββ Codes/
β βββ notebooks_01_basic_example.ipynb
β βββ notebooks_02_llm_preprocessing.ipynb
β
βββ requirements.txt
βββ requirements-dev.txt
βββ requirements-full.txt
βββ requirements-lock.txt
βββ pyproject.toml
βββ .gitignore
βββ LICENSE
βββ README.md
βββ README.pt_BR.mdbr>
- Introduction to dictionary-based grouping
- Basic Pandas operations
- Simple examples with sample data
- Advanced LLM integration
- Feature generation using grouped data
- Real-world dataset examples
π Open in Colab: Basic Example | LLM Preprocessing
The notebooks use publicly available datasets:
- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php
- Kaggle Datasets: https://www.kaggle.com/datasets
- Hugging Face Datasets: https://huggingface.co/datasets
1. Chen, X., et al. (2024). LLM-based feature generation from text for interpretable machine learning. arXiv preprint. Retrieved from arxiv.org/html/2409.07132v2
2. DataCamp. (2024). Pandas GroupBy Explained: Syntax, Examples, and Tips. Retrieved from datacamp.com/tutorial/pandas-groupby
-
GeeksforGeeks. (2024). Pandas dataframe.groupby() Method. Retrieved from geeksforgeeks.org
-
Machine Learning Mastery. (2024). Feature Engineering with LLM Embeddings: Enhancing Scikit-learn Models. Retrieved from machinelearningmastery.com
-
McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd ed.). O'Reilly Media.
-
Pandas Documentation. (2024). Group by: split-apply-combine. Retrieved from pandas.pydata.org/docs/user_guide/groupby.html
-
VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media.
πΈΰΉ My Contacts Hub
ββββββββββββββ πβ ββββββββββββββ
β£β’β€ Back to Top
Copyright 2026 Quantum Software Development. Code released under the MIT License license.