IMDb Movie Review Sentiment Analysis

A Machine Learning + Natural Language Processing (NLP) project that classifies IMDb movie reviews as Positive or Negative.

It goes beyond a basic ML model by integrating a Human-in-the-Loop (HITL) feedback system, allowing the model to improve over time using real user corrections. The project builds a complete NLP pipeline, compares multiple machine learning models, and deploys the best model using Streamlit.

Dataset Credit

Credit = https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

for live demo

streamlit version

Project Workflow


Raw Text
↓
NLP Preprocessing
↓
TF-IDF Feature Extraction
↓
Model Training
↓
Model Evaluation
↓
Best Model Selection
↓
Streamlit Deployment
↓
Model IMprovement 
↓
Feedback from user
Retrain After Certain User Reviews
↓
90% accuracy of Logistic Regression

NLP Preprocessing Pipeline

The text data was cleaned and normalized using the following steps:

HTML tag removal
Contraction expansion (don't → do not)
Lowercase conversion
Punctuation & special character removal
Tokenization
Stopword removal (while keeping negations like not, no, never)
Lemmatization

Example:


Original:
"I wasn't impressed with this movie!!!"

Processed:
not impress movie

Feature Engineering

Text data was converted into numerical form using **TF-IDF Vectorization =>

50000*10000

Baseline model max_features = 10000

50000*45000**.

With this configuration max_features = 45000 min_df = 10 max_df = 0.8 ngram_range = (1,2)

TF-IDF measures how important a word is in a document relative to the entire dataset and is widely used for traditional NLP models.

Machine Learning Models Tested

Model	Accuracy
Naive Bayes	~0.85
KNN	~0.76
Decision Tree	~0.70
Random Forest	~0.85
Logistic Regression	0.891
SVM	0.893

Best Model

Logistic Regression achieved the best performance.

Model Evaluation

Models were evaluated using:

Accuracy
Precision
Recall
F1 Score
Confusion Matrix

Word Importance Analysis

Using model coefficients, the project identifies words that strongly influence sentiment predictions.

Positive words


excellent
amazing
great
love
perfect

Negative words


terrible
worst
boring
bad
waste

Deployment

The trained model and TF-IDF vectorizer were saved using pickle and deployed with a Streamlit web application.

Run the app:


streamlit run app.py

Users can enter a movie review and get real-time sentiment predictions.

Project Structure


Sentiment-Analysis-IMDb
│── app.py # Streamlit application
│── retrain.py # Script to retrain model with feedback 
│── trained_models/ 
    │ ├── lr_model.pkl 
    │ └── tfidf_vectorizer.pkl 
│── feedback_data.csv # Stores user feedback 
│── Cleaned_Reviews_data.csv # Original dataset 
│── imbd_Dataset_data.csv # actual dataset 
│── requirements.txt 
│── README.md

Technologies Used

Python
Scikit-learn
NLTK
Pandas
NumPy
Matplotlib / Seaborn
Streamlit

Author

💖 Created with heart by Varun Kumar
B.Tech Computer Science Engineering
DAVIET JALANDHAR

Interested in:

Machine Learning
Natural Language Processing
AI Systems

Helpfull for College students to Understand the NLP pipeline and Model Evaluation

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
trained_models		trained_models
.gitattributes		.gitattributes
1 IMDB_Sentiment_Analyzer_Notebook .ipynb		1 IMDB_Sentiment_Analyzer_Notebook .ipynb
Cleaned_Reviews_data.csv		Cleaned_Reviews_data.csv
README.MD		README.MD
app.py		app.py
feedback_data.csv		feedback_data.csv
imbd_Dataset.csv		imbd_Dataset.csv
requirements.txt		requirements.txt
retrain.py		retrain.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMDb Movie Review Sentiment Analysis

Dataset Credit

for live demo

streamlit version

Project Workflow

NLP Preprocessing Pipeline

Feature Engineering

50000*10000

50000*45000**.

Machine Learning Models Tested

Best Model

Model Evaluation

Word Importance Analysis

Deployment

Project Structure

Technologies Used

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IMDb Movie Review Sentiment Analysis

Dataset Credit

for live demo

streamlit version

Project Workflow

NLP Preprocessing Pipeline

Feature Engineering

50000*10000

50000*45000**.

Machine Learning Models Tested

Best Model

Model Evaluation

Word Importance Analysis

Deployment

Project Structure

Technologies Used

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages