Skip to content

πŸ‘©πŸ»β€πŸš€ 16 – Data Mining: LLM Tabular Preprocessing with Dictionary Groups β€” Dictionary-Based Feature Grouping for LLM/AI Pipelines

License

Notifications You must be signed in to change notification settings

Quantum-Software-Development/16-DataMining_llm-tabular-preprocessing-dict-groups

Repository files navigation

[πŸ‡§πŸ‡· PortuguΓͺs] [πŸ‡¬πŸ‡§ English]


Dictionary-Based Feature Grouping for LLM/AI Pipelines



Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva



Sponsor Quantum Software Development






Note

⚠️ Heads Up







🎢 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

πŸ“Ί For better resolution, watch the video on YouTube.



Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

☞ Access Data Mining Main Repository




πŸ“š Table of Contents

  1. Overview
  2. What is Dictionary-Based Feature Grouping?
  3. Why Use This Technique?
  4. Key Concepts
  5. Installation
  6. Quick Start
  7. Basic Examples
  8. Advanced Usage with LLMs
  9. Real-World Applications
  10. Project Structure
  11. Notebooks
  12. Dataset Resources
  13. References
  14. Contact



Overview


This repository demonstrates dictionary-based feature grouping for tabular data preprocessing, specifically designed for integration with Large Language Models (LLMs) and AI/ML pipelines.

The technique allows you to organize related columns (features) in a dataset using dictionaries, enabling:


  • Semantic grouping of features
  • Efficient preprocessing for LLM-based feature engineering
  • Better interpretability of tabular data
  • Streamlined data transformation pipelines



Tip

Perfect for: Data Scientists, ML Engineers, AI Researchers, and Students working with tabular data and LLMs!




What is Dictionary-Based Feature Grouping?

Dictionary-based feature grouping is a data preprocessing technique where you use Python dictionaries to organize and group related columns (features) in a DataFrame based on their semantic meaning or data type.

πŸ’‘ Simple Explanation (For Beginners)

Imagine you have a dataset about customers with many columns:


age, income, city, state, country, purchase_date, product_name, price, ...

Instead of processing all columns individually, you can group them by meaning:


feature_groups = {
    'demographics': ['age', 'income'],
    'location': ['city', 'state', 'country'],
    'transaction': ['purchase_date', 'product_name', 'price']
}

This makes it easier to:


  1. Apply specific transformations to each group
  2. Feed organized data to LLMs
  3. Understand your dataset structure
  4. Create modular and maintainable code



Why Use This Technique?


For Traditional ML

  • πŸ“¦ Organized Feature Engineering: Group numerical, categorical, and text features separately
  • βš›οΈ Pipeline Efficiency: Apply different transformers to different feature groups
  • 🧠 Better Understanding: Know which features belong together conceptually

For LLM Integration

  • πŸ€– Semantic Context: LLMs perform better when features are semantically grouped
  • πŸ’¬ Prompt Engineering: Create structured prompts with organized feature groups
  • πŸ”— Hybrid Models: Combine tabular data with LLM embeddings effectively
  • πŸš€ Feature Generation: Use LLMs to create new features from grouped columns



πŸ“ Key Concepts

1. Pandas GroupBy

Core Python/Pandas functionality for splitting, applying, and combining data:


df.groupby('category').agg({'value': 'mean'})

2. Dictionary Mapping

Using dictionaries to define feature relationships:


column_mapping = {
    'group_name': ['col1', 'col2', 'col3']
}

3. LLM Feature Engineering

Leveraging LLMs to:

  • Generate text embeddings from grouped text columns
  • Create semantic features
  • Enrich tabular data with contextual information



Installation

Prerequisites

  • Python 3.8+
  • pip or conda

Install Dependencies


# Clone the repository
git clone https://github.com/Quantum-Software-Development/16-DataMining_llm-tabular-preprocessing-dict-groups.git
cd 16-DataMining_llm-tabular-preprocessing-dict-groups

# Install requirements
pip install -r requirements.txt

Docker Setup (Optional)


# Build Docker image
docker build -t dict-groups-preprocessing .

# Run container
docker run -p 8888:8888 dict-groups-preprocessing



Quick Start

Basic Example


import pandas as pd

# Sample dataset
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['NY', 'LA', 'Chicago'],
    'salary': [50000, 60000, 70000],
    'department': ['IT', 'HR', 'IT']
}
df = pd.DataFrame(data)

# Define feature groups
feature_dict = {
    'personal': ['name', 'age'],
    'location': ['city'],
    'professional': ['salary', 'department']
}

# Process by group
for group_name, columns in feature_dict.items():
    print(f"\nProcessing {group_name}:")
    print(df[columns].head())

Output:


Processing personal:
      name  age
0    Alice   25
1      Bob   30
2  Charlie   35

Processing location:
       city
0        NY
1        LA
2   Chicago

Processing professional:
   salary department
0   50000         IT
1   60000         HR
2   70000         IT



Basic Examples

Example 1: Grouping by Data Type


import pandas as pd
import numpy as np

# Create sample data
df = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Product A', 'Product B', 'Product C'],
    'price': [10.5, 20.0, 15.75],
    'category': ['Electronics', 'Clothing', 'Electronics']
})

# Group columns by data type
type_groups = {
    'numeric': df.select_dtypes(include=[np.number]).columns.tolist(),
    'categorical': ['category'],
    'text': ['name']
}



πŸ€– Advanced Usage with LLMs

LLM-Based Feature Generation


# Example: Using grouped text features for LLM prompts
text_groups = {
    'product_info': ['product_name', 'description'],
    'user_feedback': ['reviews']
}

# Create structured prompt for LLM
def create_llm_prompt(row, group_dict):
    prompt = ""
    for group_name, cols in group_dict.items():
        prompt += f"{group_name}: {', '.join([str(row[col]) for col in cols])}\n"
    return prompt

br>

🌐 Real-World Applications


  1. E-commerce: Group product features, pricing, and reviews
  2. Healthcare: Organize patient demographics, vitals, and medical history
  3. Finance: Separate transaction data, customer info, and risk factors
  4. NLP: Combine tabular + text data for hybrid models



πŸ“‚ Project Structure


16-DataMining_llm-tabular-preprocessing-dict-groups/
β”‚
β”œβ”€β”€ Codes/
β”‚   β”œβ”€β”€ notebooks_01_basic_example.ipynb
β”‚   └── notebooks_02_llm_preprocessing.ipynb
β”‚
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ requirements-dev.txt
β”œβ”€β”€ requirements-full.txt
β”œβ”€β”€ requirements-lock.txt
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
└── README.pt_BR.md

br>

Notebooks


1. notebooks_01_basic_example.ipynb

  • Introduction to dictionary-based grouping
  • Basic Pandas operations
  • Simple examples with sample data

2. notebooks_02_llm_preprocessing.ipynb

  • Advanced LLM integration
  • Feature generation using grouped data
  • Real-world dataset examples

πŸ‘‰ Open in Colab: Basic Example | LLM Preprocessing



Dataset Resources

The notebooks use publicly available datasets:



References

1. Chen, X., et al. (2024). LLM-based feature generation from text for interpretable machine learning. arXiv preprint. Retrieved from arxiv.org/html/2409.07132v2

2. DataCamp. (2024). Pandas GroupBy Explained: Syntax, Examples, and Tips. Retrieved from datacamp.com/tutorial/pandas-groupby

  • GeeksforGeeks. (2024). Pandas dataframe.groupby() Method. Retrieved from geeksforgeeks.org

  • Machine Learning Mastery. (2024). Feature Engineering with LLM Embeddings: Enhancing Scikit-learn Models. Retrieved from machinelearningmastery.com

  • McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd ed.). O'Reilly Media.

  • Pandas Documentation. (2024). Group by: split-apply-combine. Retrieved from pandas.pydata.org/docs/user_guide/groupby.html

  • VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media.





πŸ›ΈΰΉ‹ My Contacts Hub





────────────── πŸ”­β‹† ──────────────

➣➒➀ Back to Top

Copyright 2026 Quantum Software Development. Code released under the MIT License license.

About

πŸ‘©πŸ»β€πŸš€ 16 – Data Mining: LLM Tabular Preprocessing with Dictionary Groups β€” Dictionary-Based Feature Grouping for LLM/AI Pipelines

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published