ToMuchInfo (NAVER AI Hackathon)
Author : 이상헌, 조용래, 박성남
Preprocessing - Tokenize - Feature Extraction - Embedding - Model - Ensemble(optional)
normalizers.py: correct bad words and typoLSUV.py,ironyer.py: applied LSUV init (https://arxiv.org/pdf/1511.06422.pdf)
DummyTokenizer: dummy tokenizer that splits a sentence by spaceJamoTokenizer: split text into jamosJamoMaskedTokenizer: split text into jamos and mask movie names and actor namesTwitterTokenizer: tokenize text using konlpy's Twitter moduleSoyNLPTokenizer: tokenize text using SoyNLP's MaxScoresTokenizer
LengthFeatureExtractor: token의 길이ImportantWordFeaturesExtractor: 부정적 단어, 욕설 단어, 반전 단어의 수MovieActorFeaturesExtractor: 자주 언급된 배우/영화를 찾고 이를 one-hot encodingAbnormalWordExtractor: 직접 데이터를 보며 유의미할 것 같은 단어들 one-hot encodingSleepnessExtractor: 졸리다는 내용의 표현 수
RandomDictionary: 단순히 word를 index화 시켜서 returnFastTextDictionary: pretrained FastText embedding을 불러와 embeddingFastTextVectorizer: train set으로 FastText 학습시키고 embeddingWord2VecVectorizer: train set으로 Word2Vec 학습시키고 embeddingTfidfVectorizer: train set으로 sklearn을 사용해 tf-idf vectorize
VDCNN: Very Deep Convolutional Networks for Text ClassificationWordCNN: Convolutional Neural Networks for Sentence ClassificationBiLSTM: Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max PoolingCNNTextInception: Merging Recurrence and Inception-Like Convolution for Sentiment AnalysisDCNN-LSTM: 저희 팀이 만들었습니다.LSTM_Attention: Attention-Based Bidirectional Long Short-Term Memory Networks for Relation ClassificationRCNN: Recurrent Convolutional Neural Networks for Text ClassificationTDSM: Character-Based Text Classification using Top Down Semantic Model for Sentence Representation
- Average : 한 epoch당 여러개의 모델을 동시에 돌리면서, validation loss가 더이상 떨어지지 않으면 저장. 성능이 잘나오는 모델만 모아서 평균을 냄
- XGBRegressor : 각각의 모델당 best epoch을 찾은 후, 한번에 다 돌리고 그 결과값들로 xgboost를 써서 예측