Part 19: CSV file crawl
I think there are many situations where CSV files are handled, such as managing data in Excel and downloading the data in the service you are using as a CSV file. Fess can crawl CSV files as data sources. This time, I will explain how to crawl a CSV file.
Introduction
A CSV file is text data in which each field is separated by commas, but in Fess, by specifying a delimiter character, a tab-delimited TSV file can be handled in the same way as a CSV file. By changing the settings, it is possible to support various formats other than delimiters.
Fess’s CSV file crawl indexes each record as a single document.
This time, I will explain “CsvDatastore” that creates an index from the contents of the CSV file and “CsvListDatastore” that crawls the path described in the CSV file.
For the explanation, we will use Fess-13.4.3 this time. Get the Fess ZIP file from the download page .
CsvDatastore
First, let’s talk about CsvDatastore.
With CsvDatastore, you can create an index with the contents described in the CSV file, so you can freely set the title and URL of the indexed document.
This time, crawl the CSV file with the following contents. The CSV file is placed in the “/home/taro/csv” directory, and the file encoding is saved as “Shift_JIS”.
1,title 1,test1
2,title 2,test2
3,title 3,test3
4,title 4,test4
5,title 5,test5
6,title 6,test6
7,title 7,test7
8,title 8,test8
9,title 9,test9
CsvDatastore crawl settings
After starting Fess, log in to the administration screen and open [Crawler]> [Datastore]. Click “New” in the upper right of the screen to open the crawl settings for the datastore. This time, set the following 4 items, and proceed with the other items as default.
name
Handler name
The parameter
script
Enter a name for the crawl settings in Name and select “CsvDatastore” for the handler name. The parameters are described as follows.
directories=/home/taro/csv
fileEncoding=Shift_JIS
As described above, parameters are described in the “key=value” format. The details of the key set by the parameter are as follows.
Key | Explanation |
---|---|
directories | Directory containing CSV file (.csv or .tsv) |
files | CSV file (when specified directly) |
fileEncoding | CSV file encoding |
separatorCharacter | Delimiter |
quoteDisabled | Disable enclosing characters (true by default) |
skipLines | Number of rows to skip |
If you set the directory in “directories”, all CSV files/TSV files in the directory will be crawled. f you want to crawl individually, specify the CSV file directly in “files”.
In “separatorCharacter”, you can set the separator character. If there are delimiters other than commas, specify them here. For example, if the delimiter is a tab, describe “separatorCharacter=t” in the parameter.
“QuoteDisabled” sets whether to use the enclosing character. If you set “quoteDisabled=false”, if there is an enclosing character in the CSV file data when crawling, it will be omitted. Enclosing characters are usually written with double quotes “””.
“SkipLines” excludes up to the specified number of lines from crawling. For example, if you specify “skipLines=1”, the first line of the CSV file will be skipped.
The script is written as follows.
url="http://localhost/" + cell1
host="localhost"
site="localhost"
title=cell2
content=cell3
cache=cell3
digest=cell3
anchor=
content_length=cell3.length()
last_modified=new java.util.Date()
The script is written in the “key=value” format as well as the parameters. For details on the keys set by the script, refer to the following.
Key | Explanation |
---|---|
url | URL (link displayed in search results) |
host | hostname |
site | Site pass |
title | title |
content | Document contents (index target string) |
cache | Document cache (not indexed) |
digest | Digest part displayed in search results |
anchor | Links included in the document (usually not required) |
content_length | Document length |
last_modified | Date and time the document was last updated |
The values in the script are written in Groovy. Close the string with double quotes. In addition, the data in the CSV file is stored in cell [number] (numbers start from 1). Please note that it may be null if there is no data in the cells of the CSV file.
Crawl execution
After registering the crawl settings, click Start Now from System> Scheduler> Default Crawler (it will take a while for the crawl to complete).
After crawling is complete, please access http://localhost:8080/ and search. The search results are displayed as shown below.
CsvListDatastore
Next, I will explain about CsvListDatastore.
CsvListDatastore can be used when crawling a large number of files. The feature is that the crawl execution time can be shortened by placing a CSV file in which the updated file path is written and crawling only the specified path.
The CSV file format when describing the path is as follows.
[action]<Delimiter>[path]
Specify one of the following for the action.
create: file created
modify: File was updated
delete: The file was deleted
For the path, describe the path in the same notation as when specifying the path to be crawled by file crawl. For example, specify “file:/[path]” or “smb://[path]”.
This time, create a CSV file that crawls the local file with the following contents.
modify,file:/home/taro/data/testfile1.txt
modify,file:/home/taro/data/testfile2.txt
modify,file:/home/taro/data/testfile3.txt
modify,file:/home/taro/data/testfile4.txt
modify,file:/home/taro/data/testfile5.txt
modify,file:/home/taro/data/testfile6.txt
modify,file:/home/taro/data/testfile7.txt
modify,file:/home/taro/data/testfile8.txt
modify,file:/home/taro/data/testfile9.txt
modify,file:/home/taro/data/testfile10.txt
CsvListDatastore crawl settings
Log in to the administration screen and open “Crawler”> “Datastore”. Click “New” at the top right of the screen to open the crawl settings for the datastore. This time, the following 4 items are set, and other items are set as default.
name
Handler name
The parameter
script
Enter the name of the crawl setting as the name and select “CsvListDatastore” as the handler name.
The parameters are described as follows. Specify in directories the path where the created CSV is placed.
directories=/opt/fess/csvlist
fileEncoding=Shift_JIS
The script description is as follows.
event_type=cell1
url=cell2
Crawl execution
After registering the crawl settings, click Start Now from System> Scheduler> Default Crawler as you would for CsvDatastore.
After the crawl is complete, go to http://localhost:8080/ and try searching. The search results are displayed as shown below.
This time, I explained how to crawl CSV files. By using CsvDatastore, if there is something like CSV dump function in the linked system, it will be possible to build a system linked with Fess.
When CsvListDatastore has many crawl targets, it outputs the update file list log of NAS and crawls only the update files to support large-scale system search. It should be usable for various purposes depending on the settings.