@@ -14,27 +14,29 @@ This is where web scraping comes in. Web scraping is the practice of using
1414computer program to sift through a web page and gather the data that you need
1515in a format most useful to you.
1616
17- lxml
18- ----
17+ lxml and Requests
18+ -----------------
1919
2020`lxml <http://lxml.de/ >`_ is a pretty extensive library written for parsing
21- XML and HTML documents, which you can easily install using ``pip ``. We will
22- be using its ``html `` module to get example data from this web page: `econpy.org <http://econpy.pythonanywhere.com/ex/001.html >`_ .
21+ XML and HTML documents really fast. It even handles messed up tags. We will
22+ also be using the `Requests <http://docs.python-requests.org/en/latest/ >`_ module instead of the already built-in urlib2
23+ due to improvements in speed and readability. You can easily install both
24+ using ``pip install lxml `` and ``pip install requests ``.
2325
24- First we shall import the required modules :
26+ Lets start with the imports :
2527
2628.. code-block :: python
2729
2830 from lxml import html
29- from urllib2 import urlopen
31+ import requests
3032
31- We will use ``urllib2.urlopen `` to retrieve the web page with our data and
32- parse it using the ``html `` module:
33+ Next we will use ``requests.get `` to retrieve the web page with our data
34+ and parse it using the ``html `` module and save the results in `` tree `` :
3335
3436.. code-block :: python
3537
36- page = urlopen (' http://econpy.pythonanywhere.com/ex/001.html' )
37- tree = html.fromstring(page.read() )
38+ page = requests.get (' http://econpy.pythonanywhere.com/ex/001.html' )
39+ tree = html.fromstring(page.text )
3840
3941 ``tree `` now contains the whole HTML file in a nice tree structure which
4042we can go over two different ways: XPath and CSSSelect. In this example, I
0 commit comments