This project focuses on classifying SMS messages as Spam or Ham (Not Spam) using various machine learning models and text processing techniques in Python. The pipeline includes preprocessing, exploratory data analysis, vectorization, model training, evaluation, and model saving.
-
Reading and Understanding the Data
- Load dataset using
pandas
- Rename columns, clean data, handle duplicates and missing values
-
Data Visualization
- Use
matplotlib, seaborn, and wordcloud to explore and understand message patterns
-
Text Preprocessing
- Tokenization, stopword removal, stemming, punctuation removal using
nltk and string
-
Feature Extraction
- Convert text to numerical features using
CountVectorizer and TfidfVectorizer
-
Model Building & Evaluation
- Train and compare various machine learning models from
scikit-learn
- Evaluate using
accuracy_score, confusion_matrix, precision_score
-
Model Saving
- Save the best model and vectorizer using
pickle for future predictions
Reading and Understanding the Data
- Load dataset using
pandas - Rename columns, clean data, handle duplicates and missing values
Data Visualization
- Use
matplotlib,seaborn, andwordcloudto explore and understand message patterns
Text Preprocessing
- Tokenization, stopword removal, stemming, punctuation removal using
nltkandstring
Feature Extraction
- Convert text to numerical features using
CountVectorizerandTfidfVectorizer
Model Building & Evaluation
- Train and compare various machine learning models from
scikit-learn - Evaluate using
accuracy_score,confusion_matrix,precision_score
Model Saving
- Save the best model and vectorizer using
picklefor future predictions
-
pandas
read_csv,drop,rename,apply,value_counts,describe,isnull,duplicated,drop_duplicates
-
numpy
- For numerical operations (imported but not heavily used)
-
matplotlib
pyplot,figure,show
-
seaborn
histplot,pairplot,heatmap,catplot,barplot
-
wordcloud
WordCloudclass to visualize most common words
-
nltk (Natural Language Toolkit)
word_tokenize,sent_tokenize,stopwords,PorterStemmer,download
-
string
- Used to remove punctuation
-
Preprocessing
LabelEncoderfor encoding target labels
-
Data Splitting
train_test_splitto split data into train and test
-
Evaluation Metrics
accuracy_score,confusion_matrix,precision_score
-
Models Used
- π§ Naive Bayes:
GaussianNB,MultinomialNB,BernoulliNB - π Logistic Regression:
LogisticRegression - π Support Vector Machine:
SVC - π³ Tree-Based Models:
DecisionTreeClassifier,RandomForestClassifier,ExtraTreesClassifier - π Ensemble Methods:
AdaBoostClassifier,BaggingClassifier,GradientBoostingClassifier - π₯ Others:
KNeighborsClassifier,VotingClassifier,StackingClassifier
- π§ Naive Bayes:
-
Feature Extraction
CountVectorizer,TfidfVectorizer
- pickle
dump()to save model and vectorizerload()to load them later for predictions
-
collections
Counterclass used for word frequency count
-
warnings
- Used for ignoring unnecessary warnings during model training
- Multiple models were trained and evaluated
- Accuracy and precision were used as evaluation metrics
- The best performing model was saved using
pickle
Create a requirements.txt with: