PHPCrawl webcrawler library for PHP

PHPCrawl FAQ

Sometimes it happens that (almost) no information about a document is passed to the user-function handleDocumentInfo(), most of the properties of the corresponding PHPCrawlerDocumentInfo-object are emtpy.

Usually the reason for this is an error that occurred during the request of the document. In this case, the PHPCrawlerDocumentInfo-property "error_occured" will be true, and the "error_string" property contains the error-report as a human readable string. For timeout-errors (like "Socket-stream timed out"), try to increase the connection-timeout and/or the stream-timeout.
$crawler->setStreamTimeout(5); // defaults to 2 seconds $crawler->setConnectionTimeout(10); // defaults to 5 seconds
When trying to start the crawler in multi-process-mode, a lot of warnings like "sem_get() [function.sem-get]: failed for key 0x5202e59f: No space left on device" are thrown.

PHPCrawl is using semaphores for process-communication. When crawling-processes get aborted, the used sempahores don't get removed. If this happens too often, there will be no more space for new semaphores and the above error(s) occur. To remove "dead" semaphores, use the following unix command:

for i in `ipcs -s | awk '/phpcrawl_user/ {print $2}'`; do (ipcrm -s $i); done
... whereas "phpcrawl_user" is the user who is running the crawler.
When trying to use phpcrawl on a CentOS or RedHat sytem, a lot of "Warning: preg_match_all(): Compilation failed: unrecognized character after ..." errors get thrown.

The reason is an old PCRE-library installed on some CentOS/RedHat systems (mostly version 6.xx). Please update this library to the current version (8.xx) and everything should work fine.
The crawler finds and follows some strange links/urls like "http://mysite.com/(+" that don't exist.

By default, the crawler tries to find as many links as possible in documents. By setting some options, you can adjust the intrnal link-finding algorythm to prevent it from finding (most of) these phantom-links:
// Disable aggressive linksearch $crawler->enableAggressiveLinkSearch(false);

// Dont't let the crawler look for links in script-parts,
// html-comments etc. of documents. $crawler->excludeLinkSearchDocumentSections( PHPCrawlerLinkSearchDocumentSections::ALL_SPECIAL_SECTIONS
);

// Get sure the crawler only looks for links in HTML-documents
// (this is the default) $crawler->addLinkSearchContentType("#text/html# i");

PHPCrawl webcrawler library/framework

PHPCrawl FAQ