16- Data Mining / LLM Tabular Preprocessing with Dictionary Groups

Dictionary-Based Feature Grouping for LLM/AI Pipelines

Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva

Note

⚠️ Heads Up

Projects and deliverables may be made publicly available whenever possible.
The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
All activities comply with the academic and ethical guidelines of PUC-SP.
Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

📺 For better resolution, watch the video on YouTube.

Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

☞ Access Data Mining Main Repository

📚 Table of Contents

Overview
What is Dictionary-Based Feature Grouping?
Why Use This Technique?
Key Concepts
Installation
Quick Start
Basic Examples
Advanced Usage with LLMs
Real-World Applications
Project Structure
Notebooks
Dataset Resources
References
Contact

Overview

This repository demonstrates dictionary-based feature grouping for tabular data preprocessing, specifically designed for integration with Large Language Models (LLMs) and AI/ML pipelines.

The technique allows you to organize related columns (features) in a dataset using dictionaries, enabling:

Semantic grouping of features
Efficient preprocessing for LLM-based feature engineering
Better interpretability of tabular data
Streamlined data transformation pipelines

Tip

Perfect for: Data Scientists, ML Engineers, AI Researchers, and Students working with tabular data and LLMs!

What is Dictionary-Based Feature Grouping?

Dictionary-based feature grouping is a data preprocessing technique where you use Python dictionaries to organize and group related columns (features) in a DataFrame based on their semantic meaning or data type.

💡 Simple Explanation (For Beginners)

Imagine you have a dataset about customers with many columns:

age, income, city, state, country, purchase_date, product_name, price, ...

Instead of processing all columns individually, you can group them by meaning:

feature_groups = {
    'demographics': ['age', 'income'],
    'location': ['city', 'state', 'country'],
    'transaction': ['purchase_date', 'product_name', 'price']
}

This makes it easier to:

Apply specific transformations to each group
Feed organized data to LLMs
Understand your dataset structure
Create modular and maintainable code

Why Use This Technique?

For Traditional ML

📦 Organized Feature Engineering: Group numerical, categorical, and text features separately
⚛️ Pipeline Efficiency: Apply different transformers to different feature groups
🧠 Better Understanding: Know which features belong together conceptually

For LLM Integration

🤖 Semantic Context: LLMs perform better when features are semantically grouped
💬 Prompt Engineering: Create structured prompts with organized feature groups
🔗 Hybrid Models: Combine tabular data with LLM embeddings effectively
🚀 Feature Generation: Use LLMs to create new features from grouped columns

📝 Key Concepts

1. Pandas GroupBy

Core Python/Pandas functionality for splitting, applying, and combining data:

df.groupby('category').agg({'value': 'mean'})

2. Dictionary Mapping

Using dictionaries to define feature relationships:

column_mapping = {
    'group_name': ['col1', 'col2', 'col3']
}

3. LLM Feature Engineering

Leveraging LLMs to:

Generate text embeddings from grouped text columns
Create semantic features
Enrich tabular data with contextual information

Installation

Prerequisites

Python 3.8+
pip or conda

Install Dependencies

# Clone the repository
git clone https://github.com/Quantum-Software-Development/16-DataMining_llm-tabular-preprocessing-dict-groups.git
cd 16-DataMining_llm-tabular-preprocessing-dict-groups

# Install requirements
pip install -r requirements.txt

Docker Setup (Optional)

# Build Docker image
docker build -t dict-groups-preprocessing .

# Run container
docker run -p 8888:8888 dict-groups-preprocessing

Quick Start

Basic Example

import pandas as pd

# Sample dataset
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['NY', 'LA', 'Chicago'],
    'salary': [50000, 60000, 70000],
    'department': ['IT', 'HR', 'IT']
}
df = pd.DataFrame(data)

# Define feature groups
feature_dict = {
    'personal': ['name', 'age'],
    'location': ['city'],
    'professional': ['salary', 'department']
}

# Process by group
for group_name, columns in feature_dict.items():
    print(f"\nProcessing {group_name}:")
    print(df[columns].head())

Output:

Processing personal:
      name  age
0    Alice   25
1      Bob   30
2  Charlie   35

Processing location:
       city
0        NY
1        LA
2   Chicago

Processing professional:
   salary department
0   50000         IT
1   60000         HR
2   70000         IT

Basic Examples

Example 1: Grouping by Data Type

import pandas as pd
import numpy as np

# Create sample data
df = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Product A', 'Product B', 'Product C'],
    'price': [10.5, 20.0, 15.75],
    'category': ['Electronics', 'Clothing', 'Electronics']
})

# Group columns by data type
type_groups = {
    'numeric': df.select_dtypes(include=[np.number]).columns.tolist(),
    'categorical': ['category'],
    'text': ['name']
}

🤖 Advanced Usage with LLMs

LLM-Based Feature Generation

# Example: Using grouped text features for LLM prompts
text_groups = {
    'product_info': ['product_name', 'description'],
    'user_feedback': ['reviews']
}

# Create structured prompt for LLM
def create_llm_prompt(row, group_dict):
    prompt = ""
    for group_name, cols in group_dict.items():
        prompt += f"{group_name}: {', '.join([str(row[col]) for col in cols])}\n"
    return prompt

br>

🌐 Real-World Applications

E-commerce: Group product features, pricing, and reviews
Healthcare: Organize patient demographics, vitals, and medical history
Finance: Separate transaction data, customer info, and risk factors
NLP: Combine tabular + text data for hybrid models

📂 Project Structure

16-DataMining_llm-tabular-preprocessing-dict-groups/
│
├── Codes/
│   ├── notebooks_01_basic_example.ipynb
│   └── notebooks_02_llm_preprocessing.ipynb
│
├── requirements.txt
├── requirements-dev.txt
├── requirements-full.txt
├── requirements-lock.txt
├── pyproject.toml
├── .gitignore
├── LICENSE
├── README.md
└── README.pt_BR.md

br>

Notebooks

1. `notebooks_01_basic_example.ipynb`

Introduction to dictionary-based grouping
Basic Pandas operations
Simple examples with sample data

2. `notebooks_02_llm_preprocessing.ipynb`

Advanced LLM integration
Feature generation using grouped data
Real-world dataset examples

👉 Open in Colab: Basic Example | LLM Preprocessing

Dataset Resources

The notebooks use publicly available datasets:

UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php
Kaggle Datasets: https://www.kaggle.com/datasets
Hugging Face Datasets: https://huggingface.co/datasets

References

1. Chen, X., et al. (2024). LLM-based feature generation from text for interpretable machine learning. arXiv preprint. Retrieved from arxiv.org/html/2409.07132v2

2. DataCamp. (2024). Pandas GroupBy Explained: Syntax, Examples, and Tips. Retrieved from datacamp.com/tutorial/pandas-groupby

GeeksforGeeks. (2024). Pandas dataframe.groupby() Method. Retrieved from geeksforgeeks.org
Machine Learning Mastery. (2024). Feature Engineering with LLM Embeddings: Enhancing Scikit-learn Models. Retrieved from machinelearningmastery.com
McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd ed.). O'Reilly Media.
Pandas Documentation. (2024). Group by: split-apply-combine. Retrieved from pandas.pydata.org/docs/user_guide/groupby.html
VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media.

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
Codes		Codes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-full.txt		requirements-full.txt
requirements-lock.txt		requirements-lock.txt
requirements.txt		requirements.txt

Uh oh!

License

Quantum-Software-Development/16-DataMining_llm-tabular-preprocessing-dict-groups

Folders and files

Latest commit

History

Repository files navigation

16- Data Mining / LLM Tabular Preprocessing with Dictionary Groups

Dictionary-Based Feature Grouping for LLM/AI Pipelines

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

📚 Table of Contents

Overview

What is Dictionary-Based Feature Grouping?

💡 Simple Explanation (For Beginners)

Why Use This Technique?

For Traditional ML

For LLM Integration

📝 Key Concepts

1. Pandas GroupBy

2. Dictionary Mapping

3. LLM Feature Engineering

Installation

Prerequisites

Install Dependencies

Docker Setup (Optional)

Quick Start

Basic Example

Output:

Basic Examples

Example 1: Grouping by Data Type

🤖 Advanced Usage with LLMs

LLM-Based Feature Generation

🌐 Real-World Applications

📂 Project Structure

Notebooks

1. notebooks_01_basic_example.ipynb

2. notebooks_02_llm_preprocessing.ipynb

Dataset Resources

References

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2026 Quantum Software Development. Code released under the MIT License license.

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Languages

1. `notebooks_01_basic_example.ipynb`

2. `notebooks_02_llm_preprocessing.ipynb`

Packages