Advanced Crawler Configuration

Overview

This guide explains advanced configuration for the Fess crawler. For basic crawler configuration, refer to Basic Crawler Configuration.

Warning

The settings on this page can affect the entire system. Thoroughly test any changes before applying them to production environments.

General Settings

Configuration File Locations

Detailed crawler settings are configured in the following files:

Main configuration: /etc/fess/fess_config.properties (or app/WEB-INF/classes/fess_config.properties)
Content length configuration: app/WEB-INF/classes/crawler/contentlength.xml
Component configuration: app/WEB-INF/classes/crawler/container.xml

Default Script

Configure the default script language for the crawler.

Property	Description	Default
`crawler.default.script`	Crawler script language	`groovy`

crawler.default.script=groovy

HTTP Thread Pool

HTTP crawler thread pool settings.

Property	Description	Default
`crawler.http.thread_pool.size`	HTTP thread pool size	`0`

# 0 means auto-configuration
crawler.http.thread_pool.size=0

Document Processing Settings

Basic Settings

Property	Description	Default
`crawler.document.max.site.length`	Maximum lines for document site	`100`
`crawler.document.site.encoding`	Document site encoding	`UTF-8`
`crawler.document.unknown.hostname`	Alternative value for unknown hostname	`unknown`
`crawler.document.use.site.encoding.on.english`	Use site encoding for English documents	`false`
`crawler.document.append.data`	Append data to document	`true`
`crawler.document.append.filename`	Append filename to document	`false`

Configuration Example

crawler.document.max.site.length=100
crawler.document.site.encoding=UTF-8
crawler.document.unknown.hostname=unknown
crawler.document.use.site.encoding.on.english=false
crawler.document.append.data=true
crawler.document.append.filename=false

Word Processing Settings

Property	Description	Default
`crawler.document.max.alphanum.term.size`	Maximum alphanumeric word length	`20`
`crawler.document.max.symbol.term.size`	Maximum symbol word length	`10`
`crawler.document.duplicate.term.removed`	Remove duplicate words	`false`

Configuration Example

# Change maximum alphanumeric length to 50 characters
crawler.document.max.alphanum.term.size=50

# Change maximum symbol length to 20 characters
crawler.document.max.symbol.term.size=20

# Remove duplicate words
crawler.document.duplicate.term.removed=true

Note

Increasing max.alphanum.term.size allows indexing long IDs, tokens, URLs, etc. in their complete form, but increases index size.

Character Processing Settings

Property	Description	Default
`crawler.document.space.chars`	Whitespace character definition	`\u0009\u000A...`
`crawler.document.fullstop.chars`	Period character definition	`\u002e\u06d4...`

Configuration Example

# Default values (includes Unicode characters)
crawler.document.space.chars=\u0009\u000A\u000B\u000C\u000D\u001C\u001D\u001E\u001F\u0020\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u202F\u205F\u3000\uFEFF\uFFFD\u00B6

crawler.document.fullstop.chars=\u002e\u06d4\u2e3c\u3002

Protocol Settings

Supported Protocols

Property	Description	Default
`crawler.web.protocols`	Web crawl protocols	`http,https`
`crawler.file.protocols`	File crawl protocols	`file,smb,smb1,ftp,storage`

Configuration Example

crawler.web.protocols=http,https
crawler.file.protocols=file,smb,smb1,ftp,storage

Environment Variable Parameters

Property	Description	Default
`crawler.data.env.param.key.pattern`	Environment variable parameter key pattern	`^FESS_ENV_.*`

# Environment variables starting with FESS_ENV_ can be used in crawl configuration
crawler.data.env.param.key.pattern=^FESS_ENV_.*

robots.txt Settings

Property	Description	Default
`crawler.ignore.robots.txt`	Ignore robots.txt	`false`
`crawler.ignore.robots.tags`	Robots tags to ignore	(empty)
`crawler.ignore.content.exception`	Ignore content exceptions	`true`

# Ignore robots.txt (not recommended)
crawler.ignore.robots.txt=false

# Ignore specific robots tags
crawler.ignore.robots.tags=

# Ignore content exceptions
crawler.ignore.content.exception=true

Warning

Setting crawler.ignore.robots.txt=true may violate site terms of service. Exercise caution when crawling external sites.

Error Handling Settings

Property	Description	Default
`crawler.failure.url.status.codes`	HTTP status codes considered failures	`404`

# Treat 403 as error in addition to 404
crawler.failure.url.status.codes=404,403

System Monitoring Settings

Property	Description	Default
`crawler.system.monitor.interval`	System monitoring interval (seconds)	`60`

# Monitor system every 30 seconds
crawler.system.monitor.interval=30

Hot Thread Settings

Property	Description	Default
`crawler.hotthread.ignore_idle_threads`	Ignore idle threads	`true`
`crawler.hotthread.interval`	Snapshot interval	`500ms`
`crawler.hotthread.snapshots`	Number of snapshots	`10`
`crawler.hotthread.threads`	Number of threads to monitor	`3`
`crawler.hotthread.timeout`	Timeout	`30s`
`crawler.hotthread.type`	Monitoring type	`cpu`

Configuration Example

crawler.hotthread.ignore_idle_threads=true
crawler.hotthread.interval=500ms
crawler.hotthread.snapshots=10
crawler.hotthread.threads=3
crawler.hotthread.timeout=30s
crawler.hotthread.type=cpu

Metadata Settings

Property	Description	Default
`crawler.metadata.content.excludes`	Metadata to exclude	`resourceName,X-Parsed-By...`
`crawler.metadata.name.mapping`	Metadata name mapping	`title=title:string...`

# Metadata to exclude
crawler.metadata.content.excludes=resourceName,X-Parsed-By,Content-Encoding.*,Content-Type.*,X-TIKA.*,X-FESS.*

# Metadata name mapping
crawler.metadata.name.mapping=\
    title=title:string\n\
    Title=title:string\n\
    dc:title=title:string

HTML Crawler Settings

XPath Settings

XPath settings for extracting HTML elements.

Property	Description	Default
`crawler.document.html.content.xpath`	Content XPath	`//BODY`
`crawler.document.html.lang.xpath`	Language XPath	`//HTML/@lang`
`crawler.document.html.digest.xpath`	Digest XPath	`//META[@name='description']/@content`
`crawler.document.html.canonical.xpath`	Canonical URL XPath	`//LINK[@rel='canonical'][1]/@href`

Configuration Example

# Default settings
crawler.document.html.content.xpath=//BODY
crawler.document.html.lang.xpath=//HTML/@lang
crawler.document.html.digest.xpath=//META[@name='description']/@content
crawler.document.html.canonical.xpath=//LINK[@rel='canonical'][1]/@href

Custom XPath Examples

# Extract only specific div element as content
crawler.document.html.content.xpath=//DIV[@id='main-content']

# Include meta keywords in digest
crawler.document.html.digest.xpath=//META[@name='description']/@content|//META[@name='keywords']/@content

HTML Tag Processing

Property	Description	Default
`crawler.document.html.pruned.tags`	HTML tags to remove	`noscript,script,style,header,footer,aside,nav,a[rel=nofollow]`
`crawler.document.html.max.digest.length`	Maximum digest length	`120`
`crawler.document.html.default.lang`	Default language	(empty)

Configuration Example

# Add tags to remove
crawler.document.html.pruned.tags=noscript,script,style,header,footer,aside,nav,a[rel=nofollow],form

# Set digest length to 200 characters
crawler.document.html.max.digest.length=200

# Set default language to Japanese
crawler.document.html.default.lang=ja

URL Pattern Filters

Property	Description	Default
`crawler.document.html.default.include.index.patterns`	URL patterns to include in index	(empty)
`crawler.document.html.default.exclude.index.patterns`	URL patterns to exclude from index	`(?i).*(css\|js\|jpeg...)`
`crawler.document.html.default.include.search.patterns`	URL patterns to include in search results	(empty)
`crawler.document.html.default.exclude.search.patterns`	URL patterns to exclude from search results	(empty)

Configuration Example

# Default exclusion patterns
crawler.document.html.default.exclude.index.patterns=(?i).*(css|js|jpeg|jpg|gif|png|bmp|wmv|xml|ico|exe)

# Index only specific paths
crawler.document.html.default.include.index.patterns=https://example\\.com/docs/.*

File Crawler Settings

Basic Settings

Property	Description	Default
`crawler.document.file.name.encoding`	Filename encoding	(empty)
`crawler.document.file.no.title.label`	Label for files without title	`No title.`
`crawler.document.file.ignore.empty.content`	Ignore empty content	`false`
`crawler.document.file.max.title.length`	Maximum title length	`100`
`crawler.document.file.max.digest.length`	Maximum digest length	`200`

Configuration Example

# Process Windows-31J filenames
crawler.document.file.name.encoding=Windows-31J

# Label for files without title
crawler.document.file.no.title.label=No Title

# Ignore empty files
crawler.document.file.ignore.empty.content=true

# Title and digest lengths
crawler.document.file.max.title.length=200
crawler.document.file.max.digest.length=500

Content Processing

Property	Description	Default
`crawler.document.file.append.meta.content`	Append metadata to content	`true`
`crawler.document.file.append.body.content`	Append body to content	`true`
`crawler.document.file.default.lang`	Default language	(empty)

Configuration Example

crawler.document.file.append.meta.content=true
crawler.document.file.append.body.content=true
crawler.document.file.default.lang=ja

File URL Pattern Filters

Property	Description	Default
`crawler.document.file.default.include.index.patterns`	Patterns to include in index	(empty)
`crawler.document.file.default.exclude.index.patterns`	Patterns to exclude from index	(empty)
`crawler.document.file.default.include.search.patterns`	Patterns to include in search results	(empty)
`crawler.document.file.default.exclude.search.patterns`	Patterns to exclude from search results	(empty)

Configuration Example

# Index only specific extensions
crawler.document.file.default.include.index.patterns=.*\\.(pdf|docx|xlsx|pptx)$

# Exclude temp folders
crawler.document.file.default.exclude.index.patterns=.*/temp/.*

Cache Settings

Document Cache

Property	Description	Default
`crawler.document.cache.enabled`	Enable document cache	`true`
`crawler.document.cache.max.size`	Maximum cache size (bytes)	`2621440` (2.5MB)
`crawler.document.cache.supported.mimetypes`	MIME types to cache	`text/html`
`crawler.document.cache.html.mimetypes`	MIME types to treat as HTML	`text/html`

Configuration Example

# Enable document cache
crawler.document.cache.enabled=true

# Set cache size to 5MB
crawler.document.cache.max.size=5242880

# MIME types to cache
crawler.document.cache.supported.mimetypes=text/html,application/xhtml+xml

# MIME types to treat as HTML
crawler.document.cache.html.mimetypes=text/html,application/xhtml+xml

Note

Enabling cache displays cache links in search results, allowing users to reference content as it was at crawl time.

JVM Options

You can configure JVM options for the crawler process.

Property	Description	Default
`jvm.crawler.options`	Crawler JVM options	`-Xms128m -Xmx512m...`

Default Settings

jvm.crawler.options=-Xms128m -Xmx512m \
    -XX:MaxMetaspaceSize=128m \
    -XX:+UseG1GC \
    -XX:MaxGCPauseMillis=60000 \
    -XX:-HeapDumpOnOutOfMemoryError

Key Options Explained

Option	Description
`-Xms128m`	Initial heap size (128MB)
`-Xmx512m`	Maximum heap size (512MB)
`-XX:MaxMetaspaceSize=128m`	Maximum Metaspace size (128MB)
`-XX:+UseG1GC`	Use G1 garbage collector
`-XX:MaxGCPauseMillis=60000`	GC pause time goal (60 seconds)
`-XX:-HeapDumpOnOutOfMemoryError`	Disable heap dump on OutOfMemory

Custom Configuration Examples

For crawling large files:

jvm.crawler.options=-Xms256m -Xmx2g \
    -XX:MaxMetaspaceSize=256m \
    -XX:+UseG1GC \
    -XX:MaxGCPauseMillis=60000

For debugging:

jvm.crawler.options=-Xms128m -Xmx512m \
    -XX:MaxMetaspaceSize=128m \
    -XX:+UseG1GC \
    -XX:+HeapDumpOnOutOfMemoryError \
    -XX:HeapDumpPath=/tmp/crawler_dump.hprof

For details, see Memory Configuration.

Performance Tuning

Optimizing Crawl Speed

1. Adjust Thread Count

Increase parallel crawl count to improve crawl speed.

# Adjust thread count in crawl configuration on administration screen
Thread Count: 10

However, be mindful of load on target servers.

2. Adjust Timeouts

For slow-responding sites, adjust timeouts.

# Add to "Configuration Parameters" in crawl configuration
client.connectionTimeout=10000
client.socketTimeout=30000

3. Exclude Unnecessary Content

Excluding images, CSS, JavaScript files, etc. improves crawl speed.

# Exclude URL patterns
.*\.(jpg|jpeg|png|gif|css|js|ico)$

4. Retry Settings

Adjust retry count and interval on errors.

# Add to "Configuration Parameters" in crawl configuration
client.maxRetry=3
client.retryInterval=1000

Optimizing Memory Usage

1. Adjust Heap Size

jvm.crawler.options=-Xms256m -Xmx1g

2. Adjust Cache Size

crawler.document.cache.max.size=1048576  # 1MB

3. Exclude Large Files

# Add to "Configuration Parameters" in crawl configuration
client.maxContentLength=10485760  # 10MB

For details, see Memory Configuration.

Improving Index Quality

1. Optimize XPath

Exclude unnecessary elements (navigation, ads, etc.).

crawler.document.html.content.xpath=//DIV[@id='main-content']
crawler.document.html.pruned.tags=noscript,script,style,header,footer,aside,nav,form,iframe

2. Optimize Digest

crawler.document.html.max.digest.length=200

3. Metadata Mapping

crawler.metadata.name.mapping=\
    title=title:string\n\
    description=digest:string\n\
    keywords=label:string

Troubleshooting

Memory Shortage

Symptoms:

OutOfMemoryError recorded in fess_crawler.log
Crawling stops midway

Solutions:

Increase crawler heap size
```
jvm.crawler.options=-Xms256m -Xmx2g
```
Reduce parallel thread count
Exclude large files

For details, see Memory Configuration.

Slow Crawling

Symptoms:

Crawling takes too long
Frequent timeouts

Solutions:

Increase thread count (be mindful of target server load)

Adjust timeouts

client.connectionTimeout=5000
client.socketTimeout=10000

Exclude unnecessary URLs

Specific Content Cannot Be Extracted

Symptoms:

Page text not extracted correctly
Important information not included in search results

Solutions:

Check and adjust XPath

crawler.document.html.content.xpath=//DIV[@class='content']

Check pruned tags

crawler.document.html.pruned.tags=script,style

For content dynamically generated by JavaScript, consider alternative methods (API crawling, etc.)

Character Encoding Issues

Symptoms:

Character encoding issues in search results
Specific languages not displayed correctly

Solutions:

Check encoding settings

crawler.document.site.encoding=UTF-8
crawler.crawling.data.encoding=UTF-8

Configure filename encoding

crawler.document.file.name.encoding=Windows-31J

Check logs for encoding errors

grep -i "encoding" /var/log/fess/fess_crawler.log

Best Practices

Verify in Test Environment

Thoroughly test in a test environment before applying to production.
Gradual Adjustments

Don’t change settings drastically at once; adjust gradually and verify effectiveness.
Monitor Logs

After changing settings, monitor logs to check for errors or performance issues.
```
tail -f /var/log/fess/fess_crawler.log
```

Backups

Always back up configuration files before making changes.

cp /etc/fess/fess_config.properties /etc/fess/fess_config.properties.bak

Documentation

Document the settings you changed and the reasons why.

References

Basic Crawler Configuration - Basic Crawler Configuration
Thumbnail Configuration - Thumbnail Configuration
Memory Configuration - Memory Configuration
Log Configuration - Log Configuration
Search-Related Settings - Advanced Search Settings

Menu

Advanced Crawler Configuration

Overview

General Settings

Configuration File Locations

Default Script

HTTP Thread Pool

Document Processing Settings

Basic Settings

Configuration Example

Word Processing Settings

Configuration Example

Character Processing Settings

Configuration Example

Protocol Settings

Supported Protocols

Configuration Example

Environment Variable Parameters

robots.txt Settings

Error Handling Settings

System Monitoring Settings

Hot Thread Settings

Configuration Example

Metadata Settings

HTML Crawler Settings

XPath Settings

Configuration Example

Custom XPath Examples

HTML Tag Processing

Configuration Example

URL Pattern Filters

Configuration Example

File Crawler Settings

Basic Settings

Configuration Example

Content Processing

Configuration Example

File URL Pattern Filters

Configuration Example

Cache Settings

Document Cache

Configuration Example

JVM Options

Default Settings

Key Options Explained

Custom Configuration Examples

Performance Tuning

Optimizing Crawl Speed

Optimizing Memory Usage

Improving Index Quality

Troubleshooting

Memory Shortage

Slow Crawling

Specific Content Cannot Be Extracted

Character Encoding Issues

Best Practices

References