PHPCrawl webcrawler library/framework

Installation & Quickstart


The following steps show how to use phpcrawl:
  1. Unpack the phpcrawl-package somewhere. That's all you have to do for installation.
  2. Include the phpcrawl-mainclass to your script or project. Its located in the "libs"-path of the package.

    include("libs/PHPCrawler.class.php");

    There are no other includes needed.
  3. Extend the phpcrawler-class and override the handleDocumentInfo-method with your own code to process the information of every document the crawler finds on its way.

    class MyCrawler extends PHPCrawler
    {
      function 
    handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo)
      {
        
    // Your code comes here!
        // Do something with the $PageInfo-object that
        // contains all information about the currently 
        // received document.

        // As example we just print out the URL of the document
        
    echo $PageInfo->url."\n";
      }
    }

    For a list of all available information about a page or file within the handleDocumentInfo-method see the PHPCrawlerDocumentInfo-reference.

    Note to users of phpcrawl 0.7x or before: The old, overridable method "handlePageData()", that receives the document-information as an array, still is present and gets called. PHPcrawl 0.8 is fully compatible with scripts written for earlier versions.
  4. Create an instance of that class in your script or project, define the behaviour of the crawler and start the crawling-process.

    $crawler = new MyCrawler();
    $crawler->setURL("www.foo.com");
    $crawler->addContentTypeReceiveRule("#text/html#");
    // ...

    $crawler->go(); 

    For a list of all available setup-options/methods of the crawler take a look at the PHPCrawler-classreference.