Skip to content

congdaoduy298/Crawl-Data

Repository files navigation

I. CRAW DATA

Craw data from Goodread by Selenium and Python.

Installation and Run

  1. Python3 .

  2. Clone this repository .

$ git clone https://github.com/congdaoduy298/Crawl-Data.git 
  1. Install dependencies .
$ cd Crawl-Data/

$ pip3 install -r requirements.txt 
  1. Run file by terminal .
$ python crawl_books.py

Result

Total running time: 6181s

II. NAMED ENTITY RECOGNITION

Get Vietnamese NER by using VnCoreNLP and English NER by using nltk + spacy.

Installation

  1. Python 3.4+ (< 3.8).

  2. Have to install all dependent libraries

$ pip3 install -r requirements.txt 
  1. Clone VnCoreNLP repository and install vncorenlp.
$ git clone https://github.com/vncorenlp/VnCoreNLP
  1. Java 1.8+

  2. File VnCoreNLP-1.1.1.jar (27MB) and folder models (115MB) are placed in the same working folder.

  3. NLTK Library (Do not need if use Bert-base model).

  4. Spacy Library (Do not need if use Bert-base model).

$ python3 -m spacy download en_core_web_sm

Run

I. Use NLTK and Spacy

  1. Run VnCoreNLP server.
$ vncorenlp -Xmx2g <FULL-PATH-to-VnCoreNLP-jar-file> -p 9000 -a "wseg,pos,ner"
  1. Open new terminal.
$ python3 get_ner.py

II. Use Bert-base

  1. Get NER with Vietnamese sentences by VnCoreNLP.
$ vncorenlp -Xmx2g <FULL-PATH-to-VnCoreNLP-jar-file> -p 9000 -a "wseg,pos,ner"

$ python3 get_vn_ner.py
  1. Use GPU of Google Colab and run all code in Bert_NER.ipynb.

REFERENCES

VnCoreNLP: A Vietnamese Natural Language Processing Toolkit

Named Entity Recognition with NLTK and SpaCy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors