- supports two traversal algorithms: breadth-first and depth-first
- supports depth limiting and queue size limiting
- supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
- comes with a useful set of URI filters, such as Domain limiting
- supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
- supports custom request handling logic
- comes with a useful set of persistence handlers (memory, file. Redis soon to follow)
- supports custom persistence handlers
- collects statistics about the crawl for reporting
- dispatches useful events, allowing developers to add even more custom behavior
- supports a politeness policy
- will soon come with many default discoverers: RSS, Atom, RDF, etc.
- will soon support multiple queueing mechanisms (file, memcache, redis)
- will eventually support distributed spidering with a central queue
The easiest way to install PHP-Spider is with composer. Find it on Packagist.
Note: if you want to run the examples or unit tests, you need to do
composer install --dev
, so that all dependencies for the examples also get installed.
This is a very simple example. This code can be found in example/example_simple.php. For a more complete example with some logging, caching and filters, see example/example_complex.php. That file contains a more real-world example.
First create the spider
use VDB\Spider\Spider;
use VDB\Spider\Discoverer\XPathExpressionDiscoverer;
$spider = new Spider('http://www.dmoz.org');
Add a URI discoverer. Without it, the spider does nothing. In this case, we want all <a>
nodes from a certain <div>
$spider->addDiscoverer(new XPathExpressionDiscoverer("//div[@id='catalogs']//a"));
Set some sane options for this example. In this case, we only get the first 10 items from the start page.
$spider->setMaxDepth(1);
$spider->setMaxQueueSize(10);
Execute the crawl
$spider->crawl();
When crawling is done, we could get some info about the crawl
$stats = $spider->getStatsHandler();
echo "\nSPIDER ID: " . $stats->getSpiderId();
echo "\n ENQUEUED: " . count($stats->getQueued());
echo "\n SKIPPED: " . count($stats->getFiltered());
echo "\n FAILED: " . count($stats->getFailed());
Finally we could do some processing on the downloaded resources. In this example, we will echo the title of all resources
echo "\n\nDOWNLOADED RESOURCES: ";
foreach ($spider->getPersistenceHandler() as $resource) {
echo "\n - " . $resource->getCrawler()->filterXpath('//title')->text();
}
Contributing to PHP-Spider is as easy as Forking the repository on Github and submitting a Pull Request. The Symfony documentation contains an excellent guide for how to do that properly here: Submitting a Patch.
There a few requirements for a Pull Request to be accepted:
- Follow the coding standards: PHP-Spider follows the coding standards defined in the PSR-0, PSR-1 and PSR-2 Coding Style Guides;
- Prove that the code works with unit tests;
Note: An easy way to check if your code conforms to PHP-Spider is by running Scrutinizer on your local code. You can do it simply by downloading scrutinizer.phar and running it on your PHP-Spider repository like so:
php scrutinizer.phar run /path/to/php-spider
For things like reporting bugs and requesting features it is best to create an issue here on GitHub. It is even better to accompany it with a Pull Request. ;-)
If you have other questions, or need some tips, you can send a tweet to @phpspider or send an email to [email protected].
PHP-Spider is licensed under the MIT license.