This is a cache of https://github.com/codelibs/fess-testdata. It is a snapshot of the page as it appeared on 2025-12-05T00:00:29.694+0000.
GitHub - codelibs/fess-testdata: Test Data Repository for Crawling/Parsing
Skip to content

codelibs/fess-testdata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Test Data Repository for Search Systems

Overview

A repository of test data for verifying whether search systems can crawl and index various file types. Feel free to submit a pull request if you have files you want to test.

Directory Structure

fess-testdata/
├── files/          # Test data files
├── docker/         # Docker configurations for crawling environments
├── tools/          # Utility scripts
└── build/          # Build-related files

How to Create Test Files

File Naming

Add the prefix "test" and use the appropriate file extension.

File Content

Include the text "Lorem ipsum. (ロレム・イプサム) 吾輩は猫である。" in the content section of the file. Do not include this text in metadata sections (to clearly identify where content was extracted from).

Directory

Place files in the appropriate category directory under files/.

Test Data Files

Documents

Type Location
Text files/text/test_utf8.txt
HTML files/html/test.html
HTML files/html/test_utf8.html
HTML files/html/test_sjis.html
HTML files/html/test_hankaku.html
HTML files/html/test_nocharset.html
XML files/xml/test_utf8.xml
XML files/xml/test_sjis.xml
XML files/xml/test_entity.xml
XML files/xml/test.mm
PDF files/pdf/test.pdf
PDF files/pdf/test.ps
Markdown files/markdown/test.md
AsciiDoc files/markdown/test.adoc
reStructuredText files/markdown/test.rst
LaTeX files/latex/test.tex
EPUB files/ebook/test.epub
CHM files/help/test.chm

Office Documents

Type Location
MS Word files/msoffice/test.doc
MS Word files/msoffice/test.docx
MS excel files/msoffice/test.xls
MS excel files/msoffice/test.xlsx
MS PowerPoint files/msoffice/test.ppt
MS PowerPoint files/msoffice/test.pptx
MS Visio files/msoffice/test.vsdx
MS Project files/msoffice/test.mpp
MS Publisher files/msoffice/test.pub
RTF files/msoffice/test.rtf
OpenDocument Text files/opendocument/test.odt
OpenDocument Spreadsheet files/opendocument/test.ods
OpenDocument Presentation files/opendocument/test.odp
Apple Pages files/iwork/test.pages
Apple Numbers files/iwork/test.numbers
Apple Keynote files/iwork/test.key
Lotus 1-2-3 files/lotus/test.123
Hancom files/hancom/test.hwp
Ichitaro files/ichitaro/
DocuWorks files/docuworks/

Database

Type Location
MS Access files/database/test.accdb
MS Access (Legacy) files/database/test.mdb
FileMaker files/database/test.fmp12
dBase files/database/test.dbf

Media & Images

Type Location
PNG files/images/test.png
JPEG files/images/test.jpg
GIF files/images/test.gif
BMP files/images/test.bmp
TIFF files/images/test.tiff
SVG files/images/test.svg
MP3 files/media/test.mp3

Source Code

Type Location
C files/source_code/test.c
C++ files/source_code/test.cpp
Java files/source_code/test.java
JavaScript files/source_code/test.js
TypeScript files/source_code/test.ts
Python files/source_code/test.py
Ruby files/source_code/test.rb
Go files/source_code/test.go
Rust files/source_code/test.rs
Swift files/source_code/test.swift
Kotlin files/source_code/test.kt
PHP files/source_code/test.php
SQL files/source_code/test.sql
CSS files/source_code/test.css
SCSS files/source_code/test.scss

Scripts & Configuration

Type Location
Bash files/scripts/test.bash
Perl files/scripts/test.pl
Lua files/scripts/test.lua
PowerShell files/scripts/test.ps1
JSON files/config/test.json
YAML files/config/test.yaml
TOML files/config/test.toml
INI files/config/test.ini
Properties files/config/test.properties

Archives

Type Location
ZIP files/archive/test.zip
TAR files/archive/test.tar
TAR.GZ files/archive/test.tar.gz
BZ2 files/archive/test.txt.bz2
XZ files/archive/test.txt.xz

Email

Type Location
EML files/email/test.eml
MSG files/email/test.msg

Data

Type Location
CSV files/data/test.csv
TSV files/data/test.tsv
GeoJSON files/geodata/test.geojson
KML files/geodata/test.kml
Jupyter Notebook files/notebooks/test.ipynb
Log files/logs/test.log

Other

Type Location
Adobe Illustrator files/ai/test.ai
AutoCAD files/cad/
Font (TTF) files/fonts/test.ttf
ISO files/disk-images/test.iso
Patch files/patches/test.patch
Diff files/patches/test.diff
Old-style Characters files/other/old_style.txt

Docker Environments

The docker/ directory contains Docker Compose configurations for setting up various data source crawling environments.

Environment Description
basic Basic Authentication
digest Digest Authentication
ldap LDAP
ftp FTP
samba Samba
webdav WebDAV
mysql MySQL
postgresql PostgreSQL
mariadb MariaDB
oracle Oracle
mssql SQL Server
db2 DB2
mongodb MongoDB
elasticsearch Elasticsearch
solr Solr
redis Redis
cassandra Cassandra
couchdb CouchDB
minio MinIO (S3 Compatible)
gitlab GitLab
gitea Gitea
redmine Redmine
wordpress WordPress
bugzilla Bugzilla
mantis MantisBT
taiga Taiga
keycloak Keycloak
authentik Authentik

Tools

The tools/ directory contains utility scripts for data store operations.

Script Description
csvdatastore.sh CSV Data Store
csvlistdatastore.sh CSV List Data Store
csvgeodatastore.sh CSV Geo Data Store
esdatastore.sh Elasticsearch Data Store
eslistdatastore.sh Elasticsearch List Data Store
create_roledata.sh Role Data Creation
encrypt_roles.sh Role Encryption
thumbnail_check.sh Thumbnail Check

About

Test Data Repository for Crawling/Parsing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5