Twitter Semantic Search Project

Twitter Semantic Project using Latent Dirichlet Allocation (LDA)

Setting up the Environment

You could either use virtualenv and run the following command after activating the virtualenv or just run it in the terminal. Update the .boto file with Valid AWS Credential Keys.

pip install -r Tweets_Collect/requirements.txt

Code Help on Tweets Collection

python Tweets_Collect/tweets_collect.py -h
usage: tt3.py [-h] [-s S] [-e E] date
Tweets Collection for TREC 2011
positional arguments:
    date
optional arguments:
    -h, --help  show this help message and exit
    -s Start file number
    -e End file number

Run the Twitter Collection Code

python Tweets_Collect/tweets_collect.py 20110128 -s 20 -e 88

The above code will collect tweets for the TweetsID in for the date 2011-01-28 from files 20 to 88 (inclusive) and write 2 files for each file. One containing all the data and the other containing only the cleaned (processed - removed hashtags, hyperlinks, stemmed (using Porter Stemmer))

Generating Sequence Files for Tweets

hadoop jar SequenceFileWrite.jar com.sarcasm.dpp.SequenceFileWriteDemo <InputFileNameContainingAllTweets> <OutputSeqFileName>

The above code accepts a file containing all the tweets and prepares a sequence file wherein the key of the sequence file equals the TweetID and the value equals the Tweet. This forms the input for the second stage of Mahout cvb program.

Apache Lucene Scripts

To use Lucene Indexer to index all the files in the given directory,

./indexDocuments.sh <command> <InputDirectory>

To search for queries (in a queryfile) across the Lucene index, we use the following script

./searchDocuments.sh <command/queryFile>

Apache Mahout

(Collapsed Variational Bayes Algorithm - CVB)

hadoop jar mahout-examples-0.8-job.jar org.apache.mahout.text.SequenceFilesFromDirectory -i /se/dataset -o /se/outputseq -xm sequential
./mahout seq2sparse -i /se/outputseq -o /se/outputsparsedvec --namedVector -wt tf
./mahout rowid -i /mahoutlda/outputsparsedvec/tf-vectors -o /mahoutlda/matrix

./mahout cvb -i /mahoutlda/matrix/matrix -o /mahoutlda/lda-output -mt /mahoutlda/ldaoutput/models -dt /mahoutlda/ldaoutput/docTopics -dict /mahoutlda/outputsparsedvec/dictionary.file-0 -k 400 -x 40 -ow

./mahout vectordump -i /mahoutlda/ldaoutput/docTopics -o /mahoutlda/ldaoutput/output-docTopics -p true -d /mahoutlda/outputsparsedvec/dictionary.file-0 -dt sequencefile -sort /mahoutlda/ldaoutput/docTopics 
./mahout vectordump -i /mahoutlda/ldaoutput/model/model-1 -o /mahoutlda/ldaoutput/output-model-1 -p true -d /mahoutlda/outputsparsedvec/dictionary.file-0 -dt sequencefile -sort /mahoutlda/ldaoutput/models/model-1

Post-Processing

The post-processing scripts are run on the input folder having the term-topic matrix as part-files. The queryfile contains one test query per line to obtain the top 1000 ranked tweets for each query.

python vectorize.py <input-path> <queryfile>
python CDS.py <input-path> <querfile>
python output.py <input-path> <final-output-path>

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
EMR_Hadoop_Conf		EMR_Hadoop_Conf
Lucene_Helper		Lucene_Helper
Preprocess		Preprocess
Python Scripts		Python Scripts
Tweets_Collect		Tweets_Collect
LICENSE		LICENSE
Project_Report.pdf		Project_Report.pdf
README.md		README.md
Semantic Presentation - 12th May 2014 Final.pdf		Semantic Presentation - 12th May 2014 Final.pdf
SequenceFileWrite.jar		SequenceFileWrite.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter Semantic Search Project

Setting up the Environment

Code Help on Tweets Collection

Run the Twitter Collection Code

Generating Sequence Files for Tweets

Apache Lucene Scripts

Apache Mahout

Post-Processing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Twitter Semantic Search Project

Setting up the Environment

Code Help on Tweets Collection

Run the Twitter Collection Code

Generating Sequence Files for Tweets

Apache Lucene Scripts

Apache Mahout

Post-Processing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages