Libextract: elegant text extraction

https://travis-ci.org/datalib/libextract.svg?branch=master

    ___ __              __                  __
   / (_) /_  ___  _  __/ /__________ ______/ /_
  / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/
 / / / /_/ /  __/>  </ /_/ /  / /_/ / /__/ /_
/_/_/_.___/\___/_/|_|\__/_/   \__,_/\___/\__/

Libextract is a statistical extraction library that works on HTML and XML documents, written in Python and originating from eatihit. The philosophy and aim is to provide declaratively composed, simple and pipelined functions for users to describe their extraction algorithms.

Overview

libextract.extract(doc): Extracts text (by default) from a given HT/XML string doc. What is extracted and how it is extracted can be configured using the strategy parameter, which accepts an iterable of functions to be piped to one another (the result of the previous is the argument of the next).

Installation

pip install libextract

Usage

Extracting the text from a wikipedia page:

from requests import get
from libextract import extract

r = get('http://en.wikipedia.org/wiki/Classifier_(linguistics)')
text = extract(r.content)

Getting the node that (most likely) contains the text nodes that contain the text of the article:

from libextract.strategies import ARTICLE_NODE

node = extract(r.content, strategy=ARTICLE_NODE)

To serialize the node into JSON format:

>>> from libextract.formatters import node_json
>>> node_json(node, depth=1)
{'children': [...],
 'class': ['mw-content-ltr'],
 'id': ['mw-content-text'],
 'tag': 'div',
 'text': None,
 'xpath': '/html/body/div[3]/div[3]/div[4]'}

Using tabular extraction to get the nodes containing tabular data present in the HT/XML document:

from libextract.strategies import TABULAR

height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = list(extract(height_data.content, strategy=TABULAR))

To convert HT/XML element to python list

>>> from libextract.formatters import table_list
>>> table_list(tabs[0])
[['Country/Region',
  'Average male height',
  'Average female height',
  'Stature ratio (male to female)',
  'Sample population / age range',
  ...]]

Viewing the table in your browser:

from lxml.html import open_in_browser
open_in_browser(tabs[0])

Name		Name	Last commit message	Last commit date
Latest commit History 231 Commits
libextract		libextract
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Libextract: elegant text extraction

Overview

Installation

Usage

About

Uh oh!

Releases

Packages

Languages

License

reemalkhammash/libextract

Folders and files

Latest commit

History

Repository files navigation

Libextract: elegant text extraction

Overview

Installation

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages