Method: PHPCrawler::goMultiProcessed()



Starts the cralwer by using multi processes.
Signature:

public goMultiProcessed($process_count = 3, $multiprocess_mode = 1)

Parameters:

$process_count int Number of processes to use
$multiprocess_mode int The multiprocess-mode to use.
One of the PHPCrawlerMultiProcessModes-constants

Returns:

No information

Description:

When using this method instead of the go()-method to start the crawler, phpcrawl will use the given
number of processes simultaneously for spidering the target-url.
Using multi processes will speed up the crawling-progress dramatically in most cases.

There are some requirements though to successfully run the cralwler in multi-process mode:


  • The multi-process mode only works on unix-based systems (linux)

  • Scripts using the crawler have to be run from the commandline (cli)

  • The PCNTL-extension for php (process control) has to be installed and activated.

  • The SEMAPHORE-extension for php has to be installed and activated.

  • The POSIX-extension for php has to be installed and activated.

  • The PDO-extension together with the SQLite-driver (PDO_SQLITE) has to be installed and activated.



PHPCrawls supports two different modes of multiprocessing:

  1. PHPCrawlerMultiProcessModes::MPMODE_PARENT_EXECUTES_USERCODE

    The cralwer uses multi processes simultaneously for spidering the target URL, but the usercode provided to
    the overridable function handleDocumentInfo() gets always executed on the same main-process. This
    means that the usercode never gets executed simultaneously and so you dont't have to care about
    concurrent file/database/handle-accesses or smimilar things.
    But on the other side the usercode may slow down the crawling-procedure because every child-process has to
    wait until the usercode got executed on the main-process. This ist the recommended multiprocess-mode!

  2. PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE

    The cralwer uses multi processes simultaneously for spidering the target URL, and every chld-process executes
    the usercode provided to the overridable function handleDocumentInfo() directly from it's process. This
    means that the usercode gets executed simultaneously by the different child-processes and you should
    take care of concurrent file/data/handle-accesses proberbly (if used).

    When using this mode and you use any handles like database-connections or filestreams in your extended
    crawler-class, you should open them within the overridden mehtod initChildProcess() instead of opening
    them from the constructor. For more details see the documentation of the initChildProcess()-method.



Example for starting the crawler with 5 processes using the recommended MPMODE_PARENT_EXECUTES_USERCODE-mode:$crawler->goMultiProcessed(5, PHPCrawlerMultiProcessModes::MPMODE_PARENT_EXECUTES_USERCODE);

Please note that increasing the number of processes to high values does't automatically mean that the crawling-process
will go off faster! Using 3 to 5 processes should be good values to start from.