Basic Crawler Configuration
Overview
The Fess crawler is a feature that automatically collects content from websites, file systems, and other sources and registers it in the search index. This guide explains the basic concepts and configuration methods for the crawler.
Basic Crawler Concepts
What is a Crawler
A crawler is a program that automatically collects content by following links starting from specified URLs or file paths.
The Fess crawler has the following features:
Multi-protocol support: HTTP/HTTPS, file systems, SMB, FTP, etc.
Scheduled execution: Periodic automatic crawling
Incremental crawling: Updates only changed content
Parallel processing: Simultaneous crawling of multiple URLs
Robots.txt compliance: Respects robots.txt
Crawler Types
Fess provides the following crawler types depending on the target:
Creating Crawl Configurations
Adding Basic Crawl Configuration
Access the Administration Screen
Access
http://localhost:8080/adminin your browser and log in as administrator.Open Crawler Configuration Screen
Select “Crawler” → “Web” or “File System” from the left menu.
Create New Configuration
Click the “New” button.
Enter Basic Information
Name: Identifier for the crawl configuration (e.g., Corporate Wiki)
URL: Crawl start URL (e.g.,
https://wiki.example.com/)Crawl Interval: Crawl execution frequency (e.g., every hour)
Thread Count: Number of parallel crawls (e.g., 5)
Depth: Link traversal depth (e.g., 3)
Save
Click the “Create” button to save the configuration.
Web Crawler Configuration Examples
Crawling Internal Intranet Site
Name: Corporate Portal
URL: http://intranet.example.com/
Crawl Interval: Once per day
Thread Count: 10
Depth: Unlimited (-1)
Maximum Access Count: 10000
Crawling Public Website
Name: Product Site
URL: https://www.example.com/products/
Crawl Interval: Once per week
Thread Count: 5
Depth: 5
Maximum Access Count: 1000
File Crawler Configuration Examples
Local File System
Name: Documents Folder
URL: file:///home/share/documents/
Crawl Interval: Once per day
Thread Count: 3
Depth: Unlimited (-1)
Authentication Configuration
To access sites or file servers that require authentication, configure authentication credentials.
Select “Crawler” → “Authentication” in the administration screen
Click “New”
Enter authentication information:
Hostname: wiki.example.com Port: 443 Authentication Method: Basic Authentication Username: crawler_user Password: ********
Click “Create”
Running Crawls
Manual Execution
To run a configured crawl immediately:
Select the target configuration in the crawl configuration list
Click the “Start” button
Check job execution status in the “Scheduler” menu
Scheduled Execution
To run crawls periodically:
Open the “Scheduler” menu
Select the “Default Crawler” job
Set the schedule expression (Cron format)
# Run daily at 2 AM 0 0 2 * * ? # Run every hour at 0 minutes 0 0 * * * ? # Run at 6 PM Monday through Friday 0 0 18 ? * MON-FRI
Click “Update”
Checking Crawl Status
To check running crawl status:
Open the “Scheduler” menu
Check running jobs
Check details in logs:
tail -f /var/log/fess/fess_crawler.log
Basic Configuration Items
Restricting Crawl Targets
Restrictions by URL Pattern
You can restrict crawling to specific URL patterns or exclude them.
Include URL patterns (regular expressions):
# Crawl only under /docs/
https://example\.com/docs/.*
Exclude URL patterns (regular expressions):
# Exclude specific directories
.*/admin/.*
.*/private/.*
# Exclude specific file extensions
.*\.(jpg|png|gif|css|js)$
Depth Restriction
Restrict the depth of link traversal:
0: Start URL only
1: Start URL and pages linked from it
-1: Unlimited (follow all links)
Maximum Access Count
Upper limit on the number of pages to crawl:
Maximum Access Count: 1000
Stops after crawling 1000 pages.
Parallel Crawl Count (Thread Count)
Specifies the number of URLs to crawl simultaneously.
Warning
Increasing thread count too much places excessive load on the crawl target server. Set an appropriate value.
Crawl Interval
Specifies the frequency of crawl execution.
# Time specification
Crawl Interval: 3600000 # Milliseconds (1 hour)
# Or set in scheduler
0 0 2 * * ? # Daily at 2 AM
File Size Configuration
You can set upper limits for crawled file sizes.
Maximum File Size to Retrieve
Add the following to “Configuration Parameters” in crawler configuration:
client.maxContentLength=10485760
Retrieves files up to 10MB. Default is unlimited.
Note
When crawling large files, also adjust memory settings. See Memory Configuration for details.
Maximum File Size to Index
You can set upper limits for indexing sizes by file type.
Default values:
HTML files: 2.5MB
Other files: 10MB
Configuration file: app/WEB-INF/classes/crawler/contentlength.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE components PUBLIC "-//DBFLUTE//DTD LastaDi 1.0//EN"
"http://dbflute.org/meta/lastadi10.dtd">
<components namespace="fessCrawler">
<include path="crawler/container.xml" />
<component name="contentLengthHelper"
class="org.codelibs.fess.crawler.helper.ContentLengthHelper" instance="singleton">
<property name="defaultMaxLength">10485760</property><!-- 10M -->
<postConstruct name="addMaxLength">
<arg>"text/html"</arg>
<arg>2621440</arg><!-- 2.5M -->
</postConstruct>
<postConstruct name="addMaxLength">
<arg>"application/pdf"</arg>
<arg>5242880</arg><!-- 5M -->
</postConstruct>
</component>
</components>
This adds a configuration to process PDF files up to 5MB.
Warning
When increasing file sizes to handle, also increase crawler memory settings.
Word Length Restrictions
Overview
Long alphanumeric strings or consecutive symbols cause index size increases and performance degradation. Therefore, Fess sets the following restrictions by default:
Consecutive alphanumeric characters: Up to 20 characters
Consecutive symbols: Up to 10 characters
Configuration Method
Edit fess_config.properties.
Default settings:
crawler.document.max.alphanum.term.size=20
crawler.document.max.symbol.term.size=10
Example: Relaxing restrictions
crawler.document.max.alphanum.term.size=50
crawler.document.max.symbol.term.size=20
Note
If you need to search by long alphanumeric strings (e.g., serial numbers, tokens), increase this value. However, index size will increase.
Proxy Configuration
Overview
When crawling external sites from within an intranet, they may be blocked by a firewall. In such cases, crawl via a proxy server.
Configuration Method
Add the following to “Configuration Parameters” in the crawl configuration on the administration screen.
Basic proxy configuration:
client.proxyHost=proxy.example.com
client.proxyPort=8080
Authenticated proxy:
client.proxyHost=proxy.example.com
client.proxyPort=8080
client.proxyUsername=proxyuser
client.proxyPassword=proxypass
Exclude specific hosts from proxy:
client.nonProxyHosts=localhost|127.0.0.1|*.example.com
System-Wide Proxy Configuration
To use the same proxy for all crawl configurations, configure via environment variables.
export http_proxy=http://proxy.example.com:8080
export https_proxy=http://proxy.example.com:8080
export no_proxy=localhost,127.0.0.1,.example.com
robots.txt Configuration
Overview
robots.txt is a file that instructs crawlers whether crawling is allowed. Fess respects robots.txt by default.
Configuration Method
To ignore robots.txt, edit fess_config.properties.
crawler.ignore.robots.txt=true
Warning
When crawling external sites, respect robots.txt. Ignoring it may place excessive load on servers or violate terms of service.
User-Agent Configuration
You can change the crawler’s User-Agent.
Configuration in Administration Screen
Add to “Configuration Parameters” in crawl configuration:
client.userAgent=MyCompanyCrawler/1.0
System-Wide Configuration
Configure in fess_config.properties:
crawler.user.agent=MyCompanyCrawler/1.0
Encoding Configuration
Crawl Data Encoding
Configure in fess_config.properties:
crawler.crawling.data.encoding=UTF-8
Filename Encoding
Filename encoding for file systems:
crawler.document.file.name.encoding=UTF-8
Crawl Troubleshooting
Crawl Does Not Start
Checks:
Verify scheduler is enabled
Check if “Default Crawler” job is enabled in “Scheduler” menu
Verify crawl configuration is enabled
Check if target configuration is enabled in crawl configuration list
Check logs
tail -f /var/log/fess/fess.log tail -f /var/log/fess/fess_crawler.log
Crawl Stops Midway
Possible causes:
Memory shortage
Check for
OutOfMemoryErrorinfess_crawler.logIncrease crawler memory (see Memory Configuration)
Network errors
Adjust timeout settings
Check retry settings
Crawl target errors
Check if 404 errors are occurring frequently
Check error details in logs
Specific Page Not Crawled
Checks:
Check URL patterns
Verify page is not matched by exclude URL patterns
Check robots.txt
Check target site’s
/robots.txt
Check authentication
For pages requiring authentication, verify authentication settings
Depth restriction
Verify link depth does not exceed depth restriction
Maximum access count
Verify maximum access count has not been reached
Slow Crawling
Countermeasures:
Increase thread count
Increase parallel crawl count (but be mindful of target server load)
Exclude unnecessary URLs
Add images and CSS files to exclude URL patterns
Adjust timeout settings
For slow-responding sites, shorten timeout
Increase crawler memory
Best Practices
Crawl Configuration Recommendations
Set appropriate thread count
Set an appropriate thread count to avoid placing excessive load on target servers.
Optimize URL patterns
Exclude unnecessary files (images, CSS, JavaScript, etc.) to reduce crawl time and improve index quality.
Set depth restrictions
Set appropriate depth based on site structure. Use unlimited (-1) only when crawling the entire site.
Set maximum access count
Set an upper limit to avoid crawling unexpectedly large numbers of pages.
Adjust crawl interval
Set appropriate intervals based on update frequency. - Frequently updated sites: Every 1 hour to several hours - Infrequently updated sites: Once per day to once per week
Schedule Configuration Recommendations
Night execution
Execute during low server load times (e.g., 2 AM).
Avoid duplicate execution
Configure to start next crawl after previous crawl completes.
Error notifications
Configure email notifications for crawl failures.
References
Advanced Crawler Configuration - Advanced Crawler Configuration
Thumbnail Configuration - Thumbnail Configuration
Memory Configuration - Memory Configuration
Log Configuration - Log Configuration