| Public Methods |
|---|
| Basic settings |
| getProcessReport | | Retruns summarizing report-information about the crawling-process after it has finished. |
| go | | Starts the crawling process in single-process-mode. |
| goMultiProcessed | | Starts the cralwer by using multi processes. |
| setFollowMode | | Sets the basic follow-mode of the crawler. |
| setHTTPProtocolVersion | | Sets the HTTP protocol version the crawler should use for requests |
| setPort | | Sets the port to connect to for crawling the starting-url set in setUrl(). |
| setURL | | Sets the URL of the first page the crawler should crawl (root-page). |
| setUrlCacheType | | Defines what type of cache will be internally used for caching URLs. |
| setWorkingDirectory | | Sets the working-directory the crawler should use for storing temporary data. |
| Filter-settings |
| addContentTypeReceiveRule | | Adds a rule to the list of rules that decides which pages or files - regarding their content-type - should be received |
| addURLFilterRule | | Adds a rule to the list of rules that decide which URLs found on a page should be ignored by the crawler. |
| addURLFollowRule | | Adds a rule to the list of rules that decide which URLs found on a page should be followd explicitly. |
| obeyNoFollowTags | | Decides whether the crawler should obey "nofollow"-tags |
| obeyRobotsTxt | | Defines whether the crawler should parse and obey robots.txt-files. |
| Overridable methods / User data-processing |
| handleDocumentInfo | | Override this method to get access to all information about a page or file the crawler found and received. |
| handleHeaderInfo | | Overridable method that will be called after the header of a document was received and BEFORE the content
will be received. |
| initChildProcess | | Overridable method that will be called by every used child-process just before it starts the crawling-procedure. |
| Limit-settings |
| setContentSizeLimit | | Sets the content-size-limit for content the crawler should receive from documents. |
| setCrawlingDepthLimit | | Sets the maximum crawling depth |
| setRequestDelay | | Sets a delay for every HTTP-requests the crawler executes. |
| setRequestLimit | | Sets a limit to the total number of requests the crawler should execute. |
| setTrafficLimit | | Sets a limit to the number of bytes the crawler should receive alltogether during crawling-process. |
| Linkfinding settings |
| addLinkSearchContentType | | Adds a rule to the list of rules that decide in what kind of documents the crawler
should search for links in (regarding their content-type) |
| enableAggressiveLinkSearch | | Enables or disables agressive link-searching. |
| excludeLinkSearchDocumentSections | | Defines the sections of HTML-documents that will get ignroed by the link-finding algorithm. |
| setLinkExtractionTags | | Sets the list of html-tags the crawler should search for links in. |
| Process resumption |
| enableResumption | | Prepares the crawler for process-resumption. |
| getCrawlerId | | Returns the unique ID of the instance of the crawler |
| resume | | Resumes the crawling-process with the given crawler-ID |
| Other settings |
| addBasicAuthentication | | Adds a basic-authentication (username and password) to the list of basic authentications that will be send with requests. |
| addLinkPriority | | Adds a regular expression togehter with a priority-level to the list of rules that decide what links should be prefered. |
| addPostData | | Adds post-data together with an URL-rule to the list of post-data to send with requests. |
| addStreamToFileContentType | | Adds a rule to the list of rules that decides what types of content should be streamed diretly to a temporary file. |
| enableCookieHandling | | Enables or disables cookie-handling. |
| requestGzipContent | | Enables support/requests for gzip-encoded content. |
| setConnectionTimeout | | Sets the timeout in seconds for connection tries to hosting webservers. |
| setFollowRedirects | | Defines whether the crawler should follow redirects sent with headers by a webserver or not. |
| setFollowRedirectsTillContent | | Defines whether the crawler should follow HTTP-redirects until first content was found, regardless of defined filter-rules and follow-modes. |
| setProxy | | Assigns a proxy-server the crawler should use for all HTTP-Requests. |
| setStreamTimeout | | Sets the timeout in seconds for waiting for data on an established server-connection. |
| setUserAgentString | | Sets the "User-Agent" identification-string that will be send with HTTP-requests. |
| Deprecated |
| addFollowMatch | | Alias for addURLFollowRule(). (deprecated!) |
| addLinkExtractionTags | | Sets the list of html-tags from which links should be extracted from. (deprecated!) |
| addNonFollowMatch | | Alias for addURLFilterRule(). (deprecated!) |
| addReceiveContentType | | Alias for addContentTypeReceiveRule(). (deprecated!) |
| addReceiveToMemoryMatch | | Has no function anymore! (deprecated!) |
| addReceiveToTmpFileMatch | | Alias for addStreamToFileContentType(). (deprecated!) |
| disableExtendedLinkInfo | | Has no function anymore. (deprecated!) |
| getReport | | Retruns an array with summarizing report-information after the crawling-process has finished (deprecated!) |
| setAggressiveLinkExtraction | | Alias for enableAggressiveLinkSearch() (deprecated!) |
| setCookieHandling | | Alias for enableCookieHandling() (deprecated!) |
| setPageLimit | | Alias for setRequestLimit() method. (deprecated!) |
| setTmpFile | | Has no function anymore. (deprecated!) |