Skip to content

aflisiak/anlp19

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Course materials for Applied Natural Language Processing (Spring 2019). Syllabus: http://people.ischool.berkeley.edu/~dbamman/info256.html

Date Activity Summary
1/22 Follow setup instructions in 0.setup/ Install anaconda and set up environment for class with specific Python libraries.
1/24 Complete 1.words/ExploreTokenization_TODO.ipynb before class This notebook outlines several methods for tokenizing text into words (and sentences), including whitespace, nltk (Penn Treebank tokenizer), nltk (Twitter-aware), spaCy, and custom regular expressions, highlighting differences between them.
1/24 Execute 1.words/EvaluateTokenizationForSentiment.ipynb This notebook evaluates different methods for tokenization and stemming/lemmatization and assesses the impact on binary sentiment classification, using a train/dev dataset of sample of 1000 reviews from the Large Movie Review Dataset. Each tokenization method is evaluated on the same learning algorithm (L2-regularized logistic regression); the only difference is the tokenization process. For more, see: http://sentiment.christopherpotts.net/tokenizing.html
1/24 Complete 1.words/TokenizePrintedBooks_TODO.ipynb Design a better tokenizer for printed texts that have been OCR'd (where words are often hyphenated at line breaks).
1/29 Complete 2.distinctive_terms/CompareCorpora_TODO.ipynb This notebook explores methods for comparing two different textual datasets to identify the terms that are distinct to each one: Difference of proportions (described in Monroe et al. 2009, Fighting Words section 3.2.2; and the Mann-Whitney rank-sums test (described in Kilgarriff 2001, Comparing Corpora, section 2.3).

About

Course repo for Applied Natural Language Processing (Spring 2019)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 82.7%
  • Python 17.3%