This project demonstrates how to build a sentiment analysis model that classifies customer reviews into positive and negative categories using machine learning.
Firstly,the data is loaded in mdata and then the preprocessing applies. In preprocessig we use corpus array to hold all the filtered words from the reviews column.
This filtered words are obtained from re module to clear out all the special characters and symbols , then we convert it to lower and split it.
The most important part is Feature-Scaling using TF-IDF which enhance the ability of the model to distinguish b/w positive and negative sentiment . Then the LogisticRegression model is trained .
Also there is one point to notice that considerebly affects the outcome . The CountVectorizer() gives like 14 False Negative and 71 False Positive in contusion matrix , which clarifies that the model is gonna be affected due to the False Positive result . And since Tf-IDF out performs the Fasle Positive result by minimizing it to 6 so TF-IDF is used for vectorizing purpose.
-
Sentiment_analysis_Logistic.ipynb
Jupyter notebook containing the full workflow: preprocessing, feature extraction, model training, and evaluation. -
TestReviews.csv
Test dataset with sample reviews and sentiment labels.
-
Data Preparation
- Load training data from text files.
- It has assigned labels:
1β Positive0β Negative
-
Preprocessing
- Convert text to lowercase
- Remove special characters and numbers
- Remove stopwords (keeping negations like not)
- Apply stemming
-
Feature Extraction
- Transform reviews into numerical vectors using TF-IDF or CountVectorizer.
-
Model Training
- Use Logistic Regression to classify reviews.
- Balanced class weights to handle uneven data distribution.
-
Evaluation
- Tested on
TestReviews.csv. - Metrics used: Accuracy, Confusion Matrix.
- Tested on
- Accuracy: ~85β90% (depending on parameter tuning)
- The model correctly identifies positive and negative reviews with good balance.