Data Store Crawling

Overview

In Fess, you can crawl data sources such as databases and CSV files. This section provides an explanation of the necessary configuration for data store crawling.

Management Operations

Display Configurations

To access the list page for configuring data stores, click on [Crawler > Data Store] in the left menu.

To edit a configuration, click on the configuration name.

Create Configuration

To open the configuration page for data stores, click on the “Create New” button.

Configurations

Name

Specify the name of the crawl configuration.

Handler Name

Specifies the handler name for processing the data store.

DatabaseDataStore: Crawl a database
CsvDataStore: Crawl CSV/TSV files
CsvListDataStore: Crawl a CSV file that contains indexed file paths

Parameter

Specify parameters related to the data store.

Script

Specify how values retrieved from the data store should be assigned to fields. Expressions can be written in Groovy.

Boost

Specify the boost value for documents crawled with this configuration.

Permissions

Specify the permissions for this configuration. To display search results to users belonging to the “developer” group, specify {group}developer. User-specific specifications use {user}username, role-specific specifications use {role}rolename, and group-specific specifications use {group}groupname.

Virtual Hosts

Specify the hostname of the virtual host. Refer to the Virtual Host section of the Configuration Guide for more details.

Status

Specify whether to enable or disable this crawl configuration.

Description

Enter a description.

Delete Configuration

To delete a configuration, click on the configuration name on the list page, and then click the delete button. A confirmation dialog will appear, and the configuration will be deleted when you press the delete button.

Example

DatabaseDataStore

This section describes the database crawl.

As an example, let’s assume we have the following table in a MySQL database called “testdb,” and we can connect to it using the username “hoge” and the password “fuga”. We will provide an explanation based on this scenario.

CREATE TABLE doc (
    id BIGINT NOT null AUTO_INCREMENT,
    title VARCHAR(100) NOT null,
    content VARCHAR(255) NOT null,
    latitude VARCHAR(20),
    longitude VARCHAR(20),
    versionNo INTEGER NOT null,
    PRIMARY KEY (id)
);

In this case, we will populate the table with the following data:

INSERT INTO doc (title, content, latitude, longitude, versionNo) VALUES ('Title 1', 'This is content 1.', '37.77493', '-122.419416', 1);
INSERT INTO doc (title, content, latitude, longitude, versionNo) VALUES ('Title 2', 'This is content 2.', '34.701909', '135.494977', 1);
INSERT INTO doc (title, content, latitude, longitude, versionNo) VALUES ('Title 3', 'This is content 3.', '-33.868901', '151.207091', 1);
INSERT INTO doc (title, content, latitude, longitude, versionNo) VALUES ('Title 4', 'This is content 4.', '51.500152', '-0.113736', 1);
INSERT INTO doc (title, content, latitude, longitude, versionNo) VALUES ('Title 5', 'This is content 5.', '35.681137', '139.766084', 1);

Parameter

Here is an example of parameter configuration:

driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost:3306/testdb?useUnicode=true&characterEncoding=UTF-8
username=hoge
password=fuga
sql=select * from doc

Parameters are in the format of “key=value”. The descriptions of the keys are as follows:

driver	JDBC driver class
url	URL
username	Username to access to DB
password	Password to access to DB
sql	SQL statement to crawl data

Table: Parameter for DB

Script

Here is an example of script configuration:

url="http://SERVERNAME/" + id
host="SERVERNAME"
site="SERVERNAME"
title=title
content=content
cache=content
digest=content
anchor=
content_length=content.length()
last_modified=new java.util.Date()
location=latitude + "," + longitude
latitude=latitude
longitude=longitude

The parameters are in the format of “key=value”. The descriptions of the keys are as follows:

The values are written in Groovy. Please enclose strings in double quotation marks. You can retrieve the corresponding values by accessing database column names.

url	URL(Set the URL that can access the data according to your environment)
host	Host name
site	Site path
title	Title
content	Content
cache	Content Cache(Not search target)
digest	Displayed description in search result page
anchor	Links contained in this document(optional)
content_length	Content length
last_modified	Last modified for this document

Table: Configuration for Script

JDBC Driver

To connect to the database, a driver is required. Please place the JAR file in the app/WEB-INF/lib directory.

CsvDataStore

This section provides an explanation for crawling CSV files.

For example, generate a test.csv file with the following content in the /home/taro/csv directory. Set the file encoding to Shift_JIS.

1,Title 1,This is test 1.
2,Title 2,This is test 2.
3,Title 3,This is test 3.
4,Title 4,This is test 4.
5,Title 5,This is test 5.
6,Title 6,This is test 6.
7,Title 7,This is test 7.
8,Title 8,This is test 8.
9,Title 9,This is test 9.

Parameter

Here’s an example of parameter configuration:

directories=/home/taro/csv
fileEncoding=Shift_JIS

The parameters follow the “key=value” format. The key descriptions are as follows:

directories	Directories which contains CSV file(.csv or .tsv)
files	CSV files (if you want to specify them)
fileEncoding	Encoding of CSV files
separatorCharacter	Separator character in CSV content

Table: Parameters for CSV file

Script

Here’s an example of script configuration:

url="http://SERVERNAME/" + cell1
host="SERVERNAME"
site="SERVERNAME"
title=cell2
content=cell3
cache=cell3
digest=cell3
anchor=
content_length=cell3.length()
last_modified=new java.util.Date()

The parameters follow the “key=value” format. The keys are the same as those used for database crawling. The data in the CSV file is stored using cell[n], where n starts from 1. If a cell in the CSV file is empty, it may be null.

CsvListDataStore

This is used when crawling a large number of files. By providing a CSV file that contains paths to updated files, you can crawl only the specified paths, reducing the crawl execution time.

The format for specifying paths is as follows:

[Action]<Separator character>[Path]

You can use one of the following

Actions:

create: When a file is created.
modify: When a file is updated.
delete: When a file is deleted.

For example, in the /home/taro/csv directory, create a test.csv file with the following content. Set the file encoding to Shift_JIS.

The path is specified in the same format as when specifying paths for file crawling. Use the notation “file:/[path]” or “smb://[path]”.

modify,smb://servername/data/testfile1.txt
modify,smb://servername/data/testfile2.txt
modify,smb://servername/data/testfile3.txt
modify,smb://servername/data/testfile4.txt
modify,smb://servername/data/testfile5.txt
modify,smb://servername/data/testfile6.txt
modify,smb://servername/data/testfile7.txt
modify,smb://servername/data/testfile8.txt
modify,smb://servername/data/testfile9.txt
modify,smb://servername/data/testfile10.txt

Parameters

Here’s an example of parameter configuration:

directories=/home/taro/csv
fileEncoding=Shift_JIS

The parameters follow the “key=value” format. The key descriptions are as follows:

directories	Directories which contains CSV file(.csv or .tsv)
fileEncoding	Encoding of CSV files
separatorCharacter	Separator character in CSV content

Table: Parameter for CsvListDataStore

Script

Here’s an example of script configuration:

event_type=cell1
url=cell2

The parameters follow the “key=value” format. The keys are the same as those used for database crawling.

If authentication is required for crawling, the following settings are also necessary:

crawler.file.auth=example
crawler.file.auth.example.scheme=SAMBA
crawler.file.auth.example.username=username
crawler.file.auth.example.password=password

Menu

OVERVIEW

BASIC

DOCUMENTATION

TUTORIALS

DEVELOPMENT

OTHERS

ARCHIVES

Data Store Crawling

Overview

Management Operations

Display Configurations

Create Configuration

Configurations

Name

Handler Name

Parameter

Script

Boost

Permissions

Virtual Hosts

Status

Description

Delete Configuration

Example

DatabaseDataStore

Parameter

Script

JDBC Driver

CsvDataStore

Parameter

Script

CsvListDataStore

Parameters

Script