Skip to content

jimmy0000/libextract

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

libextract

libextract is a library for extracting text out of HT/XML documents using a statistical, functionally pure approach. It originated from the eatiht_ repository.

From a very high level persepective, the algorithm can be reduced to around 4 steps:

  • Find the text nodes in the page.
  • Make a histogram of the their parents and text length.
  • The highest scoring parent node is selected.
  • The text in the highest scoring one is joined in a string and returned as the result of the extraction.

At the lowest level, libextract is just a pipelining library. It provides composable, small functions that can be piped together to process the HT/XML document.

Usage

from requests import get
from libextract import extract
from libextract.stratgies import ARTICLE_NODE

r = get('http://en.wikipedia.org/wiki/Classifier_(linguistics)')
text = extract(r.content)

# To get the HT/XML node:
node = extract(r.content, strategy=ARTICLE_NODE)

About

extracts text from articles

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 99.1%
  • Python 0.9%