Part 7: Crawling Sites with Authentication
Since we have had time since last time, this time we will explain how to crawl sites that require authentication.
Many websites have restricted access that can be used after logging in. There are many ways to authenticate websites, but Fess can crawl websites with those certifications. Fess supports Basic, Digest, NTLM and Form authentication.
What is web authentication?
Web authentication at Fess is authentication at a website that requires a login. Websites can set up web authentication so that only certain users can access it.
There are various types of web authentication, but we will briefly explain the authentication methods supported by Fess.
Basic authentication is one of the basic authentication methods defined in HTTP. You can access the site by writing the authentication information in the Authorization field of the HTTP header and sending it. Digest authentication and NTLM authentication are accessed using HTTP, similar to Basic authentication.
Form authentication differs from the above authentication in that the user logs in using a login form instead of HTTP, and the system authenticates the user using cookie information and other information. Form authentication is a web authentication used in many web applications.
How to set up web authentication
Learn how to crawl web-authenticated sites. This time, we use Fess 12.3.1. The Fess ZIP file can be obtained from the download page. Extract the ZIP file and execute bin/fess.[sh|bat]
to start.
First, open the Fess administration screen in your browser and create a crawl with “Crawl”> “Web”. Create this web crawl configuration just as you would for a normal site crawl.
Select “Crawl” > “Web Authentication” from the menu on the left to display the web authentication setting list screen.
Press the “New” button on the upper right to display the crawl setting screen. The explanation of the main setting items is as follows.
Item | Description |
---|---|
Host name | Host name of target site (any host name if omitted) |
Port | Port number of the target site (any port number if omitted) |
Realm | Realm of target site (any realm name if omitted) |
Scheme | Authentication method |
User name | User name to log in to the target site |
Password | Password to log in to the target site |
Parameter | Set if there are settings required to log in to the authentication site |
Web settings | Crawler name to crawl authentication sites |
The following is an example of settings for crawling sites with Basic, Digest, NTLM, and Form authentication.
Basic authentication
Consider crawling a site for which Basic authentication has been set with the following settings.
Item | Value |
---|---|
URL | https://basic.codelibs.org/ |
Username | testuser |
Password | testpass |
If you create a crawl configuration with the name BasicAuth Example
, configure the following for web authentication.
Item | Value |
---|---|
Hostname | basic.codelibs.org |
Port | (omitted) |
Realm | (omitted) |
Scheme | Basic |
Username | testuser |
Password | testpass |
Parameters | (not entered) |
Web Settings | BasicAuth Example |
The host name can be omitted. If you need to handle multiple web authentications in one crawl setup, specify the host name so that each site can be authenticated.
Digest authentication
Crawl sites that have Digest authentication configured with the following settings.
Item | Value |
---|---|
URL | https://digest.codelibs.org/ |
Username | testuser |
Password | testpass |
If you created a crawl configuration with the name DigestAuth Example, configure the following for web authentication:
Item | Value |
---|---|
Scheme | Digest |
Username | testuser |
Password | testpass |
Parameters | (not entered) |
Web Settings | DigestAuth Example |
NTLM authentication
Crawl a site with NTLM authentication configured with the following settings:
Item | Value |
---|---|
URL | https://ntlm.codelibs.org/ |
Username | testuser |
Password | testpass |
If you create a crawl configuration with the name NTLMAuth Example, configure the following for web authentication.
Item | Value |
---|---|
Scheme | NTLM |
Username | testuser |
Password | testpass |
Parameters | Fill in as needed |
Web Settings | NTLMAuth Example |
For NTLM authentication, the workstation name and domain name can be set as the workstation and domain values, respectively. Set these values according to the target environment. When setting, describe as follows in the parameter column.
workstation = HOGE
domain = FUGA
Form authentication
There are various sites for Form authentication, but this time we will explain as an example of crawling Redmine, a web application for project management. Redmine can be used with the following settings.
Item | Value |
---|---|
URL | https://redmine.codelibs.org/ |
Username | testuser |
Password | testpass |
If you create a crawl configuration with the name Redmine Example, configure the following for web authentication.
Item | Value |
---|---|
Scheme | Form |
Username | testuser |
Password | testpass |
Parameters | encoding=UTF-8 token_method=GET token_url=https://redmine.codelibs.org/login token_pattern=name=”authenticity_token” +value=”([^”]+)” token_name = authenticity_token login_method = POST login_url = https://redmine.codelibs.org/login login_parameters=username=${username}&password=${password} |
Web Settings | Redmine Example |
Redmine uses authenticity_token
as a transaction token, so you need to send it along with your login information when you log in. authenticity_token
can be obtained on the login screen of Redmine. To get the token, Fess sets the method of getting it with token_
and gets the value of authenticiy_token
.
Set the information required to log in to the site with login_
. In login_url
, specify the URL for login authentication processing, and in login_parameters
, specify the request parameters required for login. ${username}
and ${password}
set the username and password values for web authentication.
Using the above information, Fess will automatically log in to the site when crawling and crawl the site with Form authentication.
Form authentication methods vary from website to website. When crawling a site with Form authentication, you need to check the HTML and HTTP headers on the login page and set the appropriate parameters.
Summary
This time, we introduced how to crawl various web authentication sites of Fess. There are many sites that require authentication, such as sites used in companies and membership sites, and you often want to search for these sites as well. Fess also supports Form authentication, so you can build an environment to search in many situations.