Skip to content

Process and extract data from HT/XML using small, pipelined functions!

License

Notifications You must be signed in to change notification settings

reemalkhammash/libextract

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

231 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Libextract: elegant text extraction

https://travis-ci.org/datalib/libextract.svg?branch=master
    ___ __              __                  __
   / (_) /_  ___  _  __/ /__________ ______/ /_
  / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/
 / / / /_/ /  __/>  </ /_/ /  / /_/ / /__/ /_
/_/_/_.___/\___/_/|_|\__/_/   \__,_/\___/\__/

Libextract is a statistical extraction library that works on HTML and XML documents, written in Python and originating from eatihit. The philosophy and aim is to provide declaratively composed, simple and pipelined functions for users to describe their extraction algorithms.

Overview

libextract.extract(doc)
Extracts text (by default) from a given HT/XML string doc. What is extracted and how it is extracted can be configured using the strategy parameter, which accepts an iterable of functions to be piped to one another (the result of the previous is the argument of the next).

Installation

pip install libextract

Usage

Extracting the text from a wikipedia page:

from requests import get
from libextract import extract

r = get('http://en.wikipedia.org/wiki/Classifier_(linguistics)')
text = extract(r.content)

Getting the node that (most likely) contains the text nodes that contain the text of the article:

from libextract.strategies import ARTICLE_NODE

node = extract(r.content, strategy=ARTICLE_NODE)

To serialize the node into JSON format:

>>> from libextract.formatters import node_json
>>> node_json(node, depth=1)
{'children': [...],
 'class': ['mw-content-ltr'],
 'id': ['mw-content-text'],
 'tag': 'div',
 'text': None,
 'xpath': '/html/body/div[3]/div[3]/div[4]'}

Using tabular extraction to get the nodes containing tabular data present in the HT/XML document:

from libextract.strategies import TABULAR

height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = list(extract(height_data.content, strategy=TABULAR))

To convert HT/XML element to python list

>>> from libextract.formatters import table_list
>>> table_list(tabs[0])
[['Country/Region',
  'Average male height',
  'Average female height',
  'Stature ratio (male to female)',
  'Sample population / age range',
  ...]]

Viewing the table in your browser:

from lxml.html import open_in_browser
open_in_browser(tabs[0])

About

Process and extract data from HT/XML using small, pipelined functions!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.1%
  • HTML 2.9%