libextract

libextract is a library for extracting text out of HT/XML documents using a statistical, functionally pure approach. It originated from the eatihit repository.

From a very high level persepective, the algorithm can be reduced to around 4 steps:

Find the text nodes in the page.
Make a histogram of the their parents and text length.
The highest scoring parent node is selected.
The text in the highest scoring one is joined in a string and returned as the result of the extraction.

At the lowest level, libextract is just a pipelining library. It provides composable, small functions that can be piped together to process the HT/XML document.

Usage

from requests import get
from libextract import extract
from libextract.strategies import ARTICLE_NODE

r = get('http://en.wikipedia.org/wiki/Classifier_(linguistics)')
text = extract(r.content)

# To get the HT/XML node:
node = extract(r.content, strategy=ARTICLE_NODE)

# Tabular data extraction
from libextract.strategies import TABULAR
reddit = get("http://reddit.com")
tabs = extract(reddit.content, strategy=TABULAR))

# To view extracted tabular html
from lxml.html import open_in_browser
open_in_browser(tabs[0])

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
libextract		libextract
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

libextract

Usage

About

Uh oh!

Releases

Packages

Languages

License

qiwsir/libextract

Folders and files

Latest commit

History

Repository files navigation

libextract

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages