Web crawler FAQ
editWeb crawler FAQ
editLooking for the Elastic web crawler? see the Elastic web crawler documentation.
View frequently asked questions about the App search web crawler:
see Web crawler reference for detailed technical information about the web crawler.
What functionality is supported?
edit- 
Crawling HTTP/HTTPs websites
Includes support for both publicly-accessible and private/intranet web sites. self-signed ssL certificates and custom Certificate Authorities are supported.
 - support for crawling multiple domains per-Engine
 - Robots meta tag support
 - 
Robots "nofollow" support
Includes robots meta tags set to "nofollow" and links with rel="nofollow" attributes.
 - 
Robots.txt support
The web crawler honors directives within robots.txt files.
 - 
sitemap support
The web crawler honors XML sitemaps, and fetches sitemaps identified within robots.txt files. Additional sitemaps can also be managed on the domain through the domain dashboard.
 - 
Configurable content extraction
The web crawler will extract a predefined, set of fields (url, body content, etc) from each page it visits. In addition to this, the crawler also supports extracting dynamic fields from meta tags.
 - 
"Entry points"
Entry points allow customers to specify where the web crawler begins crawling each domain.
 - 
"Crawl rules"
Crawl rules allow customers to control whether each URL the web crawler encounters will be visited and indexed.
 - 
Logging of each crawl
Logs are representative of an entire crawl, which encompasses all domains in an engine.
 - 
Automatic crawling
Configure the cadence for new crawls to start automatically if there isn’t an active crawl.
 - 
User interfaces and APIs for managing domains, entry points, and crawl rules
Crawler configuration can be managed via App search dashboard UIs or via a set of public APIs provided by the product.
 - 
Crawl persistence
Crawler uses Elasticsearch to maintain its state during an active crawl, allowing crawls to be migrated between instances in case of an instance failure or a restart of an Enterprise search instance running a crawl. Each unique URL is only visited once thanks to the seen URLs list persisted in Elasticsearch. Crawl-specific indexes are automatically cleaned up after a crawl is finished.
 - 
Crawling websites behind authentication
The web crawler can crawl content protected by HTTP authentication or content sitting behind an HTTP proxy (with or without authentication). see the following references:
 
What functionality is not supported?
edit- 
single-page app (sPA) support
The crawler cannot currently crawl pages that are pure Javascript single-page apps. We recommend looking at dynamic rendering to help your crawler properly index your Javascript websites.
 - 
The crawler does not support form-based authentication.
The crawler only supports basic authentication and authentication header (e.g. bearer tokens) authentication methods.