GitHub

Computational Genomics Final Project - Document Similarity

Prerequisites for word2vec approach

python
python module gensim
(if not running on a mac) compiled word2vec and word2phrase binaries, placed in bin/ directory.

Running a global alignment using word2vec based substitution matrix

Download Memetracker cluster dataset into data/ directory.
Train the word2vec vectors or use Google's pre-trained model

Download Google's pre-trained model here into data/ directory.
OR run python train_word2vec.py to create one from memetracker-cluster-dataset

If using Google's pre-trained model:
gunzip downloaded file in data/ directory
open drive_memecluster_align.py and uncomment w2v_bin_fn pointing to Google filename in line 90.

run python drive_memecluster_align.py to create and print alignments
- Currently aligns against the phrase 'what does not kill us makes us stronger' by default
  - This can be changed by commenting out line 70 and uncommenting line 69
- top aligned phrases per phrase will print aligned alongside alignment score
- top aligned phrases are stored as pickle files in data/

Datasets

1. Memetracker

Memetracker cluster dataset can be downloaded here.

2. Penn Paraphrase Database

Download from http://www.cis.upenn.edu/~ccb/ppdb/

3. Microsoft Research Paraphrases

Available upon request

Other Processing

Prerequisites for memetracker raw dataset (not used in report)

python
python langid module (detects English phrases, used for preprocessing)
python module gensim
maven (used for preprocessing)
java (used for preprocessing)
(if not running on a mac) word2vec and word2phrase binaries

Preprocessing raw memetracker data (not used in report)

Download raw phrase dataset
copy dataset into data/ directory
cd to nlp/ directory, run mvn install
run python preprocess_cornell_quotes.py (this actually runs a custom implementation of Stanford's CoreNLP)
- This produces the lemmatized file of quotes
run python word2vec.py

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.ipynb_checkpoints		.ipynb_checkpoints
bin		bin
data		data
evaluation		evaluation
nlp		nlp
pkl		pkl
trigram		trigram
.gitignore		.gitignore
CMSC701 Final project.ipynb		CMSC701 Final project.ipynb
README.md		README.md
blosum.py		blosum.py
calc_expected_value.py		calc_expected_value.py
drive_meme_enquotes2008_lemma_align.py		drive_meme_enquotes2008_lemma_align.py
drive_memecluster_align.py		drive_memecluster_align.py
drive_microsoft_align.py		drive_microsoft_align.py
evaluate.py		evaluate.py
parse_memetracker.py		parse_memetracker.py
parse_microsoft.py		parse_microsoft.py
parse_paraphrase.py		parse_paraphrase.py
preprocess_cornell_quotes.py		preprocess_cornell_quotes.py
run_global_alignment.py		run_global_alignment.py
train_word2vec.py		train_word2vec.py
util.py		util.py
w2v_sub_matrix.py		w2v_sub_matrix.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computational Genomics Final Project - Document Similarity

Prerequisites for word2vec approach

Running a global alignment using word2vec based substitution matrix

Datasets

1. Memetracker

2. Penn Paraphrase Database

3. Microsoft Research Paraphrases

Other Processing

Prerequisites for memetracker raw dataset (not used in report)

Preprocessing raw memetracker data (not used in report)

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

cmoose/cmsc701_final_project

Folders and files

Latest commit

History

Repository files navigation

Computational Genomics Final Project - Document Similarity

Prerequisites for word2vec approach

Running a global alignment using word2vec based substitution matrix

Datasets

1. Memetracker

2. Penn Paraphrase Database

3. Microsoft Research Paraphrases

Other Processing

Prerequisites for memetracker raw dataset (not used in report)

Preprocessing raw memetracker data (not used in report)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages