PHPCrawl webcrawler library for PHP - Multiprocessing modes

Multiprocessing modes

PHPCrawl supports two different types of multiprocessing.

The first one and the default is "MPMODE_PARENT_EXECUTES_USERCODE". When running this mode, the overridable function handleDocumentInfo() containing usercode always gets executed by the main-process of the crawler. This means that the code provided by the user never gets executed simultaneously and so you don't have to care about concurrent file/database/handle-accesses or similar things. This is the recommended multiprocessing-mode.

Example of starting the crawler in this mode:

$crawler->goMultiProcessed(5,
PHPCrawlerMultiProcessModes::MPMODE_PARENT_EXECUTES_USERCODE);

or simply

$crawler->goMultiProcessed(5);

The second one is "MPMODE_CHILDS_EXECUTES_USERCODE".
In this mode all used child-processes are calling the function handleDocumentInfo() directly from their process-context, so the code you provided to the overridden method handleDocumentInfo() probably will be executed simultaneously by the different child-processes. This may result in a better performance, but you always should take care of concurrent file/data/handle-accesses and all other typical things to take care of when using parallel-computing.

When using the "MPMODE_CHILDS_EXECUTES_USERCODE" mode and you use any handles like database-connections or filestreams in your extended crawler-class, you should open them within the overridden mehtod initChildProcess() instead of opening them from the constructor. For more details see the documentation of the initChildProcess()-method.

Example of starting the crawler in this mode:

$crawler->goMultiProcessed(5,
PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE);

PHPCrawl webcrawler library/framework

Multiprocessing modes