- python
- python module gensim
- (if not running on a mac) compiled word2vec and word2phrase binaries, placed in bin/ directory.
- Download Memetracker cluster dataset into data/ directory.
- Train the word2vec vectors or use Google's pre-trained model
- Download Google's pre-trained model here into data/ directory.
- OR run
python train_word2vec.pyto create one from memetracker-cluster-dataset
- If using Google's pre-trained model:
- gunzip downloaded file in data/ directory
- open
drive_memecluster_align.pyand uncomment w2v_bin_fn pointing to Google filename in line 90.
- run
python drive_memecluster_align.pyto create and print alignments- Currently aligns against the phrase 'what does not kill us makes us stronger' by default
- This can be changed by commenting out line 70 and uncommenting line 69
- top aligned phrases per phrase will print aligned alongside alignment score
- top aligned phrases are stored as pickle files in data/
- Currently aligns against the phrase 'what does not kill us makes us stronger' by default
1. Memetracker
- Memetracker cluster dataset can be downloaded here.
- Download from http://www.cis.upenn.edu/~ccb/ppdb/
- Available upon request
- python
- python langid module (detects English phrases, used for preprocessing)
- python module gensim
- maven (used for preprocessing)
- java (used for preprocessing)
- (if not running on a mac) word2vec and word2phrase binaries
- Download raw phrase dataset
- copy dataset into data/ directory
- cd to nlp/ directory, run
mvn install - run
python preprocess_cornell_quotes.py(this actually runs a custom implementation of Stanford's CoreNLP)- This produces the lemmatized file of quotes
- run
python word2vec.py