Skip to content

Commit faae04c

Browse files
committed
Added scenario about web scraping using lxml
1 parent 2a9c732 commit faae04c

File tree

1 file changed

+82
-0
lines changed

1 file changed

+82
-0
lines changed

docs/scenarios/scrape.rst

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
HTML Scraping
2+
=============
3+
4+
Web Scraping
5+
------------
6+
7+
Web sites are written using HTML, which means that each web page is a
8+
structured document. Sometimes it would be great to obtain some data from
9+
them and preserve the structure while we're at it, but this isn't always easy
10+
- it's not often that web sites provide their data in comfortable formats
11+
such as `.csv`.
12+
13+
This is where web scraping comes in. Web scraping is the practice of using
14+
computer program to sift through a web page and gather the data that you need
15+
in a format most useful to you.
16+
17+
lxml
18+
----
19+
20+
`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
21+
XML and HTML documents, which you can easily install using `pip`. We will
22+
be using its `html` module to get data from this web page: `econpy <http://econpy.pythonanywhere.com/ex/001.html>'_ .
23+
24+
First we shall import the required modules:
25+
26+
.. code-block:: python
27+
28+
from lxml import html
29+
from urllib2 import urlopen
30+
31+
We will use `urllib2.urlopen` to retrieve the web page with our data and
32+
parse it using the `html` module:
33+
34+
.. code-block:: python
35+
36+
page = urlopen('http://econpy.pythonanywhere.com/ex/001.html')
37+
tree = html.fromstring(page.read())
38+
39+
`tree` now contains the whole HTML file in a nice tree structure which
40+
we can go over in many different ways, one of which is using XPath. XPath
41+
is a way of locating information in structured documents such as HTML or XML
42+
pages. A good introduction to XPath is 'here <http://www.w3schools.com/xpath/default.asp>'_ .
43+
One can also use various tools for obtaining the XPath of elements such as
44+
FireBug for Firefox or in Chrome you can right click an element, choose
45+
'Inspect element', highlight the code and the right click again and choose
46+
'Copy XPath'.
47+
48+
After a quick analysis, we see that in our page the data is contained in
49+
two elements - one is a div with title 'buyer-name' and the other is a
50+
span with class 'item-price'. Knowing this we can create the correct XPath
51+
query and use the lxml `xpath` function like this:
52+
53+
.. code-block:: python
54+
55+
#This will create a list of buyers:
56+
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
57+
#This will create a list of prices
58+
prices = tree.xpath('//span[@class="item-price"]/text()')
59+
60+
Lets see what we got exactly:
61+
62+
.. code-block:: python
63+
64+
print 'Buyers: ', buyers
65+
print 'Prices: ', prices
66+
67+
::
68+
Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes',
69+
'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff',
70+
'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup',
71+
'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire',
72+
'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
73+
74+
Prices: ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25',
75+
'$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11',
76+
'$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68',
77+
'$15.00', '$114.07', '$10.09']
78+
79+
Congratulations! We have successfully scraped all the data we wanted from
80+
a web page using lxml and we have it stored in memory as two lists. Now we
81+
can either continue our work on it, analyzing it using python or we can
82+
export it to a file and share it with friends.

0 commit comments

Comments
 (0)