Prediction of Toxic Comments, Multinomial Classification

Team Members
	- Sanjay Roberts 
	- Jeff Coady
	- Chris Haddad

Project Submission:
	- EDA.ipynb
	- model.ipynb
	- toxic_cnn.ipynb
	- Presentation.ipynb

Justification	
	- Our problem involves NLP and predictive modeling.  Jeff Coady has a background in linguistics, 
	- Chris Haddad has experience with predictive modeling, and Sanjay Roberts has some experience with both NLP and modeling.
	- Further, Sanjay Roberts has experience with Kaggle competitions and the rest of the team is anxious to learn about the process of competing.


Problems/Motivation
	- Detect toxic comments and minimize unintended model bias
	- Our goal is to build a model that recognizes toxicity in comments and minimizes unintended bias with respect to mentions of identities. 
	- https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview




Libraries and Tools (What you already know and what else you need to evaluate)
	- Known Packages:
		+ matplotlib
		+ pandas
		+ tensorflow
		+ scikit-learn
		+ scipy
		+ imbalanced-learn
		+ keras
		+ ipython-autotime
		+ psutil
		+ nltk
		+ gensim
	- Potential Packages:
		+ elmo
		+ bird
		+ spacy
		+ textblob
	
Data Collection
	- We will be using a Kaggle dataset.  The dataset is labeled for identity mentions and optimizing a metric designed to measure unintended bias. 
	- The dataset can be found here:
		+ https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data
	
Features:
	+ male
	+ female
	+ transgender
	+ other_gender
	+ heterosexual
	+ homosexual_gay_or_lesbian
	+ bisexual
	+ other_sexual_orientation
	+ christian
	+ jewish
	+ muslim
	+ hindu
	+ buddhist
	+ atheist
	+ other_religion
	+ black
	+ white
	+ asian
	+ latino
	+ other_race_or_ethnicity
	+ physical_disability
	+ intellectual_or_learning_disability
	+ psychiatric_or_mental_illness
	+ other_disability
	
Any Literature review
	- Our intention is to begin by reviewing the leading kernels here:
		+ https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/kernels
	- And the following papers:
		+ https://www.hergertarian.com/detecting-toxic-comments-with-multitask-deep-learning
			- ompl
		+ https://becominghuman.ai/my-solution-to-achieve-top-1-in-a-novel-data-science-nlp-competition-db8db2ee356a
			- kaggle and NLP
		+ https://arxiv.org/pdf/1802.09957.pdf
			- ensemble that outperforms individual deep or shallow models
		+ http://demo.clab.cs.cmu.edu/ethical_nlp/
			- ethics 
		+ https://medium.com/@nehabhangale/toxic-comment-classification-models-comparison-and-selection-6c02add9d39f
			- LSTM v CNN, tokenization, embedding
		
Required work detail before build model
	- data cleansing
	- Tokenize and Pad
	- FastText Embedding

What is the predictive task and model detail.
	- multinomial classification of toxicity classes
	- seeking to use ensemble methods
	- grid search 
	- k-folds
	- imbalance-learn
	
Model evaluation and selection strategy.
	- k-folds for cross validation
	- Matthews Correlation Coefficient for accuracy
	- start with LSTM and CNN, move to ensemble
	- use kaggle submissions to test against other competitors
		+ retrain, resubmit to kaggle, evaluate rank, retrain, ...

How a user is going to test the final model. is there any webpage/command line interface.
	- TBD

Tentative time line of activities.
	- Week of April 22nd
		+ read literature
	- Week of April 29th
		+ make ML ready dataset
	- Week of May 6th
		+ test LSTM using grid search for hyper parameter tuning, k-folds for bias analysis
	- Week of May 13th 
		+ test CNN, grid search, k-folds
	- Week of May 20th
		+ begin testing ensemble learning method
	- Week of May 27th
		+ continue testing ensemble learning method
	- Week of June 3rd
		+ final submission to kaggle with best results
	- Week of June 10th
		+ summarize results

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.gitignore		.gitignore
EDA.ipynb		EDA.ipynb
LDAVis_train.html		LDAVis_train.html
Presentation.ipynb		Presentation.ipynb
Presentation.slides.html		Presentation.slides.html
README.md		README.md
SGDC.jpg		SGDC.jpg
SGDC.pkl		SGDC.pkl
SGDC_classwieghts.pkl		SGDC_classwieghts.pkl
SGDC_classwieghts_15_85.pkl		SGDC_classwieghts_15_85.pkl
class.jpg		class.jpg
dist.jpg		dist.jpg
gpu.png		gpu.png
htop.jpg		htop.jpg
htop2.jpg		htop2.jpg
htop3.jpg		htop3.jpg
htop4.jpg		htop4.jpg
ldavis.jpg		ldavis.jpg
mlp_test.pkl		mlp_test.pkl
mlpclassifier1.pkl		mlpclassifier1.pkl
model.ipynb		model.ipynb
most_toxic.jpg		most_toxic.jpg
multilayerperceptron.jpg		multilayerperceptron.jpg
naivebayes.jpg		naivebayes.jpg
naivebayes.pkl		naivebayes.pkl
naivebayes_weights_class_weights.pkl		naivebayes_weights_class_weights.pkl
nb_test.pkl		nb_test.pkl
nb_weights_test.pkl		nb_weights_test.pkl
nontoxic.jpg		nontoxic.jpg
place.jpg		place.jpg
randomforest.jpg		randomforest.jpg
randomforest.pkl		randomforest.pkl
randomforest_classweights.pkl		randomforest_classweights.pkl
requirements.txt		requirements.txt
rf2.jpg		rf2.jpg
rf2_test.pkl		rf2_test.pkl
rf_test.pkl		rf_test.pkl
rf_weights_test.pkl		rf_weights_test.pkl
setup.txt		setup.txt
sgdc_test.pkl		sgdc_test.pkl
sgdc_weights_test.pkl		sgdc_weights_test.pkl
toxic.jpg		toxic.jpg
toxic_cnn.ipynb		toxic_cnn.ipynb
virtual_env.txt		virtual_env.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prediction of Toxic Comments, Multinomial Classification

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Prediction of Toxic Comments, Multinomial Classification

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages