Goutte is a screen scraping and web crawling library for PHP.
Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.
Goutte depends on PHP 5.4+ and Guzzle 4+.
Tip
If you need support for PHP 5.3 or Guzzle 3, use Goutte 1.x.
Add fabpot/goutte as a require dependency in your composer.json file:
composer require fabpot/goutteTip
You can also download the Goutte.phar file:
require_once '/path/to/goutte.phar';The phars for Goutte 1.x are also available for download <http://get.sensiolabs.org/goutte-v1.0.7.phar>.
Create a Goutte Client instance (which extends
Symfony\Component\BrowserKit\Client):
use Goutte\Client;
$client = new Client();Make requests with the request() method:
// Go to the symfony.com website
$crawler = $client->request('GET', 'http://www.symfony.com/blog/');The method returns a Crawler object
(Symfony\Component\DomCrawler\Crawler).
Fine-tune cURL options:
$client->getClient()->setDefaultOption('config/curl/'.CURLOPT_TIMEOUT, 60);Click on links:
// Click on the "Security Advisories" link
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);Extract data:
// Get the latest post in this category and display the titles
$crawler->filter('h2 > a')->each(function ($node) {
print $node->text()."\n";
});Submit forms:
$crawler = $client->request('GET', 'http://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx'));
$crawler->filter('.flash-error')->each(function ($node) {
print $node->text()."\n";
});Read the documentation of the BrowserKit and DomCrawler Symfony Components for more information about what you can do with Goutte.
Goutte is a thin wrapper around the following fine PHP libraries:
- Symfony Components: BrowserKit, CssSelector and DomCrawler;
- Guzzle HTTP Component.
Goutte is licensed under the MIT license.