This project is using Python to build a spam filter with Naive Bayes, Support Vector Machine (SVM) and Neural Network using Enron Email Corpus(ham:16454, spam:17171). (http://www.cs.cmu.edu/~enron/). Useful information can be found at http://www.aueb.gr/users/ion/docs/ceas2006_paper.pdf.
scikit-learn http://scikit-learn.org/stable/
NLTK http://nltk.org/
BeautifulSoup http://www.crummy.com/software/BeautifulSoup/
Scipy http://www.scipy.org/
Numpy http://www.numpy.org/
-
Loading enron emails corpus into memory
-
Tokenizing files into word, and store them into lists
-
Feature extraction
-
Feature selection based on the words from corpus
-
Training classifiers with Naive Bayes, SVM and AdaBoosting algorithms
-
Evaluating the classifier
-
Using Adaboost method to improve the accuaracy
-
Checking results and improve the speed using Numpy & Scipy
1). more datasets 2). reduce demonsionality 3). add prori probability 4). optimizate the program
-
Web Data Mining
-
Programming Collective Intelligence
-
Machine Learning in Action
-
Scipy and Numpy
-
Building Machine System with Python