An end-to-end NLP pipeline that automates student support by classifying queries into High, Medium, or Low priority. Built with Scikit-Learn, it features a custom preprocessing engine (stemming, emoji handling, short-form expansion) and a multi-stage ColumnTransformer pipeline for seamless text and categorical data integration.
This project implements an end-to-end NLP pipeline designed to automate student support desk operations. It classifies incoming student queries into High, Medium, or Low priority levels based on the query text and the target department.
The system utilizes a custom preprocessing engine and a nested Scikit-Learn Pipeline architecture to handle text vectorization and categorical encoding simultaneously.
- Custom NLP Preprocessor: Handles lowercasing, punctuation removal, short-form expansion (e.g., "asap" ➔ "as soon as possible"), emoji removal, and Porter Stemming.
- Nested Pipeline Architecture: Uses
ColumnTransformerto manage text data (TfidfVectorizer) and categorical data (OneHotEncoder) in a single unified object. - Automated Model Selection: Includes a benchmarking suite for Logistic Regression, Linear SVC, Random Forest, and Naive Bayes with Hyperparameter tuning via
GridSearchCV. - Pickle-Ready: Architecture designed for easy deployment via
joblib.
├── data/
│ └── University_Query.csv # Dataset
├── models/
│ ├── ModelPipeline.pkl # Trained Pipeline object
│ └── Label_Map.pkl # Numerical to Label mapping
├── notebooks/
│ └── Pipelining.ipynb # Data analysis & Model training
| TextPreprocessing.ipynb
├── src/
│ └── transformers.py # Custom Preprocess & Flattener classes
├── app.py # Streamlit Web Application
├── requirements.txt # Dependencies
└── README.md
Open your terminal or command prompt and run:
git clone [https://github.com/your-username/university-query-priority.git](https://github.com/your-username/university-query-priority.git)
cd university-query-priorityIt is highly recommended to use a virtual environment to avoid dependency conflicts:
python -m venv venvvenv\Scripts\activatesource venv/bin/activateInstall all required libraries and download the necessary NLTK data:
pip install -r requirements.txt
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"This project uses Streamlit for the frontend. To launch the web interface, run:
streamlit run app.pyIf you wish to retrain the model or explore the data analysis, launch the Jupyter Notebook:
jupyter notebook notebooks/Training_EDA.ipynb