Method: PHPCrawler::addStreamToFileContentType()



Adds a rule to the list of rules that decides what types of content should be streamed diretly to a temporary file.
Signature:

public addStreamToFileContentType($regex)

Parameters:

$regex string The rule as a regular-expression

Returns:

bool  TRUE if the rule was added to the list and the regex is valid.

Description:

If a content-type of a page or file matches with one of these rules, the content will be streamed directly into a
temporary file without claiming local RAM.

It's recommendend to add all content-types of files that may be of bigger size to prevent memory-overflows.
By default the crawler will receive every content to memory!

The content/source of pages and files that were streamed to file are not accessible directly within the overidden method
handleDocumentInfo(), instead you get information about the file the content was stored in.
(see properties PHPCrawlerDocumentInfo::received_to_file and PHPCrawlerDocumentInfo::content_tmp_file).

Please note that this setting doesn't effect the link-finding results, also file-streams will be checked for links.

A common setup may look like this example:// Basically let the crawler receive every content (default-setting)
$crawler->addReceiveContentType("##");

// Tell the crawler to stream everything but "text/html"-documents to a tmp-file
$crawler->addStreamToFileContentType("#^((?!text/html).)*$#");