Check out the different ways to ingest data into Elasticsearch and dive into practical examples to try something new.
Elasticsearch is packed with new features to help you build the best search solutions for your use case. Start a free trial now.
Open Crawler does not have official Windows support, but that doesn’t mean it won’t run in Windows! In this blog, we will explore using Docker to get Open Crawler up and running in your Windows environment.
We are going to explore two different ways of downloading and running Open Crawler on your system. Both methods will rely on Docker, and the instructions will be quite similar to what can be found in Open Crawler’s existing documentation. However, we will be sure to point out the (very minor!) modifications you must make to any commands or files to make standing up Open Crawler a smooth experience!
Prerequisites
Before getting started, make sure you have the following installed on your Windows machine:
- git
- Docker Desktop
- Docker Desktop CLI (included with Docker Desktop)
- Docker Compose (included with Docker Desktop)
You can learn more about installing Docker Desktop here.
Furthermore, this blog assumes version 0.3.0
or newer of Open Crawler. Using the :latest
tagged Docker image should result in at least version 0.3.0
as of the time of writing.
Creating a configuration YAML
Before getting into the different ways of getting Open Crawler running, you need to create a basic configuration file for Open Crawler to use.
Using a text editor of your choice, create a new file called crawl-config.yml
with the following content and save it somewhere accessible.
Running Open Crawler directly via Docker image
Step 1: Pull the Open Crawler Docker image
First, you must download the Open Crawler Docker image onto our local machine. The docker pull
command can automatically download the latest Docker image.
Run the following command in your command-line terminal:
If you are curious about all of the versions of Open Crawler that are available, or want to experience a snapshot build of Open Crawler, check out the Elastic Docker integrations page to see all of the available images.
After the command executes, you can run the docker images
command to ensure the image is in your local images:
Step 2: Execute a crawl
Now that a configuration YAML has been made, you can use it to execute a crawl!
From the directory where your crawl-config.yml
is saved, run the following command:
Please be mindful of the use of Windows-style backslashes and Unix-style forward slashes in the command’s volume (-v) argument. The left-hand side of the colon is a Windows-style path (with a backslash) and the right-hand side has a forward slash.
The -v argument is mapping a local file (.\crawl-config.yml
) to a path inside the container (/crawl-config.yml
).
Running Open Crawler with docker-compose
Step 1: Clone the repository
Use git
to clone the Open Crawler repository into a directory of your choosing:
Tip: Don’t forget, you can always fork the repository as well!
Step 2: Copy your configuration file into the config folder
At the top level of the crawler
repository, you will see a directory called config
. Copy the configuration YAML you created, crawl-config.yml
, into this directory.
Step 3: Modify the docker-compose file
At the very top level of the crawler repository, you will find a file called docker-compose.yml
. You will need to ensure the local configuration directory path under volumes is Windows-compliant.
Using your favorite text editor, open docker-compose.yml and change “./config” to “.\config”:
This volumes
configuration allows Docker to mount your local repository’s config
folder to the Docker container, which will allow the container to see and use your configuration YAML.
The left-hand side of the colon is the local path to be mounted (hence why it must be Windows-compliant), and the right-hand side is the destination path in the container, which must be Unix-compliant.
Step 4: Spin up the container
Run the following command to bring up an Open Crawler container:
You can verify either on Docker Desktop (in the Containers page) or by running the following command to verify the container is indeed running:
Step 5: Execute a crawl command
Finally, you can execute a crawl! The following command will initiate a crawl in the running container that was just spun up:
Here, the command is only using Unix-style forward slashes, because it is calling the Open Crawler CLI that resides inside the container.
Once the command begins running, you should see the output of a successful crawl! 🎉
The above console output has been shortened for brevity, but the main log lines you should look out for are here!
Conclusion
As you can see, it only takes a little mindfulness around Windows-style paths to make the Open Crawler Docker workflow compatible with Windows! As long as Windows paths use backslashes and Unix paths use forward slashes, you will be able to get Open Crawler working as well as it would in a Unix environment.
Now that you have Open Crawler running, check out the documentation in the repository to learn more about how to configure Open Crawler for your needs!