Spam Message Detection Classifier Project

Aaditya Bansal

Feb 26, 20232 min read

Introduction

Spam messages have been a nuisance for a long time, and it's no secret that they are a significant problem for individuals, organizations, and governments worldwide. In recent years, machine learning has been used to detect spam messages, with remarkable success rates.

In this blog post, we will discuss completing a project on text spam detection classification using the SMS Spam dataset available on Kaggle.

You can also view this project on GitHub: Data-Science-Projects/Spam Detection Classifier.ipynb at main · aadityab7/Data-Science-Projects (github.com)

The SMS Spam dataset is a collection of SMS messages that are labeled as either "spam" or "ham" (non-spam). The dataset contains 5,572 messages, of which 4,827 are labeled as ham and 747 as spam. The dataset is publicly available on Kaggle, which is a platform that hosts machine learning datasets and competitions.

Overview

The project involved building a classification model that could accurately distinguish between spam and ham messages. The project was divided into the following steps:

Data Extraction

In this project, I began by importing the dataset, checking for missing values, and removing duplicates from data.

EDA (Exploratory Data Analysis)

Exploratory data analysis (EDA) is also essential to understand the data and identify patterns.

I used various visualizations and statistical tests to explore the class distributions and create new features such as number of characters, words and sentences to better understand the text data. Some interesting findings from the EDA include:

The SPAM messages are usually longer than HAM messages and contain more characters and words in general.

Data Cleaning and Preprocessing

The first step was to preprocess the data by cleaning it and transforming it into a format that the machine learning model could understand.

This involved dealing with Null values and removing duplicates, and the following steps:

Tokenization
Convert to lowercase.
Removing Special Characters and punctuations
Removing Stop words
Stemming
Vectorization (convert text data into numerical feature)

Feature Engineering

The next step was to extract features from the preprocessed data that could be used to train the machine learning model. This involved using techniques such as bag of words and term frequency-inverse document frequency (TF-IDF). TF-IDF performs better in this case.

Model Selection

The next step was to select an appropriate machine learning model that could accurately classify the messages as spam or ham. We experimented with various algorithms, including Naive Bayes, Support Vector Machines (SVM), and Random Forest Classifiers.

Model Evaluation

Once the model was trained, the next step was to evaluate its performance on a held-out test set.

I used various evaluation metrics such as accuracy, precision, recall, and F1-score to assess the performance of the model.

Conclusion

After completing these steps, I was able to build a highly accurate spam detection model. The final model achieved an accuracy of 98.4%, which means that it correctly classified 98.4% of the messages as either spam or ham.

With the Precision of 99.1%, Recall of 88% and F1-score of 93.4%

Overall, completing the project on text spam detection classification using the SMS Spam dataset was an excellent learning experience. It allowed me to gain hands-on experience with machine learning algorithms, data preprocessing techniques, and model evaluation methods. I encourage anyone interested in machine learning to try out this project, as it is an excellent way to gain practical experience in the field.