Skip to content

Commit 7ec689f

Browse files
committed
Grammar fix, got rid of DOS line endings
1 parent a899f41 commit 7ec689f

File tree

1 file changed

+101
-99
lines changed

1 file changed

+101
-99
lines changed

docs/scenarios/scrape.rst

Lines changed: 101 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -1,99 +1,101 @@
1-
HTML Scraping
2-
=============
3-
4-
Web Scraping
5-
------------
6-
7-
Web sites are written using HTML, which means that each web page is a
8-
structured document. Sometimes it would be great to obtain some data from
9-
them and preserve the structure while we're at it. Web sites provide
10-
don't always provide their data in comfortable formats such as ``.csv``.
11-
12-
This is where web scraping comes in. Web scraping is the practice of using a
13-
computer program to sift through a web page and gather the data that you need
14-
in a format most useful to you while at the same time preserving the structure
15-
of the data.
16-
17-
lxml and Requests
18-
-----------------
19-
20-
`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
21-
XML and HTML documents really fast. It even handles messed up tags. We will
22-
also be using the `Requests <http://docs.python-requests.org/en/latest/>`_ module instead of the already built-in urlib2
23-
due to improvements in speed and readability. You can easily install both
24-
using ``pip install lxml`` and ``pip install requests``.
25-
26-
Lets start with the imports:
27-
28-
.. code-block:: python
29-
30-
from lxml import html
31-
import requests
32-
33-
Next we will use ``requests.get`` to retrieve the web page with our data
34-
and parse it using the ``html`` module and save the results in ``tree``:
35-
36-
.. code-block:: python
37-
38-
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
39-
tree = html.fromstring(page.text)
40-
41-
``tree`` now contains the whole HTML file in a nice tree structure which
42-
we can go over two different ways: XPath and CSSSelect. In this example, I
43-
will focus on the former.
44-
45-
XPath is a way of locating information in structured documents such as
46-
HTML or XML documents. A good introduction to XPath is on `W3Schools <http://www.w3schools.com/xpath/default.asp>`_ .
47-
48-
There are also various tools for obtaining the XPath of elements such as
49-
FireBug for Firefox or if you're using Chrome you can right click an
50-
element, choose 'Inspect element', highlight the code and then right
51-
click again and choose 'Copy XPath'.
52-
53-
After a quick analysis, we see that in our page the data is contained in
54-
two elements - one is a div with title 'buyer-name' and the other is a
55-
span with class 'item-price':
56-
57-
::
58-
59-
<div title="buyer-name">Carson Busses</div>
60-
<span class="item-price">$29.95</span>
61-
62-
Knowing this we can create the correct XPath query and use the lxml
63-
``xpath`` function like this:
64-
65-
.. code-block:: python
66-
67-
#This will create a list of buyers:
68-
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
69-
#This will create a list of prices
70-
prices = tree.xpath('//span[@class="item-price"]/text()')
71-
72-
Lets see what we got exactly:
73-
74-
.. code-block:: python
75-
76-
print 'Buyers: ', buyers
77-
print 'Prices: ', prices
78-
79-
::
80-
81-
Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes',
82-
'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff',
83-
'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup',
84-
'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire',
85-
'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
86-
87-
Prices: ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25',
88-
'$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11',
89-
'$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68',
90-
'$15.00', '$114.07', '$10.09']
91-
92-
Congratulations! We have successfully scraped all the data we wanted from
93-
a web page using lxml and Requests. We have it stored in memory as two
94-
lists. Now we can do all sorts of cool stuff with it: we can analyze it
95-
using Python or we can save it a file and share it with the world.
96-
97-
A cool idea to think about is modifying this script to iterate through
98-
the rest of the pages of this example dataset or rewriting this
99-
application to use threads for improved speed.
1+
HTML Scraping
2+
=============
3+
4+
Web Scraping
5+
------------
6+
7+
Web sites are written using HTML, which means that each web page is a
8+
structured document. Sometimes it would be great to obtain some data from
9+
them and preserve the structure while we're at it. Web sites don't always
10+
provide their data in comfortable formats such as ``csv`` or ``json``.
11+
12+
This is where web scraping comes in. Web scraping is the practice of using a
13+
computer program to sift through a web page and gather the data that you need
14+
in a format most useful to you while at the same time preserving the structure
15+
of the data.
16+
17+
lxml and Requests
18+
-----------------
19+
20+
`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
21+
XML and HTML documents really fast. It even handles messed up tags. We will
22+
also be using the `Requests <http://docs.python-requests.org/en/latest/>`_
23+
module instead of the already built-in urlib2 due to improvements in speed and
24+
readability. You can easily install both using ``pip install lxml`` and
25+
``pip install requests``.
26+
27+
Lets start with the imports:
28+
29+
.. code-block:: python
30+
31+
from lxml import html
32+
import requests
33+
34+
Next we will use ``requests.get`` to retrieve the web page with our data
35+
and parse it using the ``html`` module and save the results in ``tree``:
36+
37+
.. code-block:: python
38+
39+
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
40+
tree = html.fromstring(page.text)
41+
42+
``tree`` now contains the whole HTML file in a nice tree structure which
43+
we can go over two different ways: XPath and CSSSelect. In this example, I
44+
will focus on the former.
45+
46+
XPath is a way of locating information in structured documents such as
47+
HTML or XML documents. A good introduction to XPath is on
48+
`W3Schools <http://www.w3schools.com/xpath/default.asp>`_ .
49+
50+
There are also various tools for obtaining the XPath of elements such as
51+
FireBug for Firefox or the Chrome Inspector. If you're using Chrome, you
52+
can right click an element, choose 'Inspect element', highlight the code,
53+
right click again and choose 'Copy XPath'.
54+
55+
After a quick analysis, we see that in our page the data is contained in
56+
two elements - one is a div with title 'buyer-name' and the other is a
57+
span with class 'item-price':
58+
59+
::
60+
61+
<div title="buyer-name">Carson Busses</div>
62+
<span class="item-price">$29.95</span>
63+
64+
Knowing this we can create the correct XPath query and use the lxml
65+
``xpath`` function like this:
66+
67+
.. code-block:: python
68+
69+
#This will create a list of buyers:
70+
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
71+
#This will create a list of prices
72+
prices = tree.xpath('//span[@class="item-price"]/text()')
73+
74+
Lets see what we got exactly:
75+
76+
.. code-block:: python
77+
78+
print 'Buyers: ', buyers
79+
print 'Prices: ', prices
80+
81+
::
82+
83+
Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes',
84+
'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff',
85+
'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup',
86+
'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire',
87+
'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
88+
89+
Prices: ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25',
90+
'$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11',
91+
'$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68',
92+
'$15.00', '$114.07', '$10.09']
93+
94+
Congratulations! We have successfully scraped all the data we wanted from
95+
a web page using lxml and Requests. We have it stored in memory as two
96+
lists. Now we can do all sorts of cool stuff with it: we can analyze it
97+
using Python or we can save it to a file and share it with the world.
98+
99+
A cool idea to think about is modifying this script to iterate through
100+
the rest of the pages of this example dataset or rewriting this
101+
application to use threads for improved speed.

0 commit comments

Comments
 (0)