Machine Learning projects coded using Python
Includes Data Cleaning and Feature Engineering files
The 'ProcessedCSV' file contains a spam dataset which was manually picked based on subject line content and sender email address. This dataset can be used to extract features for spam detection which can then be used to train an ML algorithm to detect spam.
Here is the link to the original Enron Email corpus: https://www.cs.cmu.edu/~enron/
The machine learning algorithms displayed in the MLClassification Algorithms notebook were used for spam detection. This was the code used for a group project in my first machine learning course and all three of us contributed to this. We obtained decent metrics when we tested out machine learning models. This file also contains a naive labelling function which allowed us to search for keywords and separate emails into different categories.
As part of this report, I also ran a network analysis on the most frequent email recipients. the graph and the code can be found in the 'EnronEmployeeNetworkGraph' file
A full report of our findings can be found here: https://fariakh973079136.files.wordpress.com/2021/01/finalml-project-report-1.pdf
The file titled 'Email Thread Processing' contains all the preprocessing functions used to get the email files into a usable state for text analysis. However, the notebook doesn't seem to render so unless you download the notebook, you cannot see the functions and their outputs. If you are only interested in seeing the functions, please click on the 'EmailProcessingCode' Python file.
The file labelled 'NLPCode' contains the code used to preprocess email bodies into structured text. Includes code for tokenization, lemmatization, POS tagging and POS-tree parsing