Skip to content

Process and extract data from HT/XML using small, pipelined functions!

License

Notifications You must be signed in to change notification settings

qiwsir/libextract

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

libextract

libextract is a library for extracting text out of HT/XML documents using a statistical, functionally pure approach. It originated from the eatihit repository.

From a very high level persepective, the algorithm can be reduced to around 4 steps:

  • Find the text nodes in the page.
  • Make a histogram of the their parents and text length.
  • The highest scoring parent node is selected.
  • The text in the highest scoring one is joined in a string and returned as the result of the extraction.

At the lowest level, libextract is just a pipelining library. It provides composable, small functions that can be piped together to process the HT/XML document.

Usage

from requests import get
from libextract import extract
from libextract.strategies import ARTICLE_NODE

r = get('http://en.wikipedia.org/wiki/Classifier_(linguistics)')
text = extract(r.content)

# To get the HT/XML node:
node = extract(r.content, strategy=ARTICLE_NODE)

# Tabular data extraction
from libextract.strategies import TABULAR
reddit = get("http://reddit.com")
tabs = extract(reddit.content, strategy=TABULAR))

# To view extracted tabular html
from lxml.html import open_in_browser
open_in_browser(tabs[0])

About

Process and extract data from HT/XML using small, pipelined functions!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 91.4%
  • HTML 8.6%