Elastic Open Web Crawler as a code

elasticsearch allows you to index data quickly and in a flexible manner. Try it free in the cloud or run it locally to see how easy indexing can be.

With Elastic Open Web Crawler and its CLI-driven architecture, having versioned crawler configurations and a CI/CD pipeline with local testing is now pretty straightforward to achieve.

Traditionally, managing crawlers was a manual, error-prone process. It involved editing configurations directly in the UI and struggling with cloning crawl configurations, rolling back, versioning, and more. Treating crawler configurations as code resolves this by providing the same benefits we expect in software development: repeatability, traceability, and automation.

This workflow makes it easier to bring the Open Web Crawler into your CI/CD pipeline for rollbacks, backups, and migrations—tasks that were much trickier with earlier Elastic Crawlers, such as the Elastic Web Crawler or App Search Crawler.

In this article, we are going to learn how to:

Manage our crawl configs using GitHub
Having a local setup to test pipelines before deploying
Create a production setup to run the web crawler with new settings every time we push changes to our main branch

You can find the project repository here. As per writing, I’m using elasticsearch 9.1.3 and Open Web Crawler 0.4.2.

Prerequisites

Docker Desktop
elasticsearch instance
Virtual machine with SSH access (e.g, AWS EC2) and Docker installed

Steps

Folder structure
Crawler configuration
Docker-compose file (local environment)
Github Actions
Testing locally
Deploying to prod
Making changes and re-deploying

Folder structure

For this project, we will have the following file structure:

Crawler configuration

Under crawler-config.yml, we will put the following:

This will crawl from https://web-scraping.dev/products, a mock site for products. We will only crawl the first three product pages. The max_crawl_depth setting will prevent the crawler from discovering more pages than the ones defined as seed_urls by not opening the links within them.

elasticsearch host and api_key will be populated dynamically depending on the environment in which we are running the script.

The product page for "Box of Chocolate Candy" from the web-scraping.dev domain, a mock site for testing web scraping. The page displays the product title, image, description, and HTML elements for price and buy buttons.

Docker-compose file (local environment)

For the local docker-compose.yml, we will deploy the crawler and a single elasticsearch cluster + Kibana, so we can easily visualize our crawling results before deploying to production.

Note how the crawler waits until elasticsearch is ready to run.

Github Actions

Now we need to create a GitHub Action that will copy the new settings and run the crawler in our virtual machine on every push to main. This ensures we always have the latest configuration deployed, without having to manually enter into the virtual machine to update files and run the crawler. We are going to use AWS EC2 as the virtual machine provider.

The first step is to add the host (VM_HOST), machine user (VM_USER), SSH RSA key (VM_KEY), elasticsearch host (ES_HOST), and elasticsearch API Key (ES_API_KEY) to the GitHub Action secrets:

An "actions, secrets and variables" configuration page, displaying repository secrets such as VM_HOST, VM_KEY, and VM_USER within a web-based interface.

This way, the action will be able to access our server to copy the new files over and run the crawl.

Now, let’s create our .github/workflows/deploy.yml file:

This action will execute the following steps every time we push changes to the crawler configuration file:

Populate elasticsearch host and API Key in the yml config
Copy the config folder to our VM
Connect via SSH to our VM
Run the crawl with the config we just copied from the repo

Testing locally

To test our crawler locally, we created a bash script that populates the elasticsearch host with the local one from Docker and starts a crawl. You can run ./local.sh to execute it.

Let’s look at Kibana DevTools to confirm the web-crawler-index was populated correctly:

Kibana DevTools code to confirm the web-crawler-index is setup correctly.

Deploying to prod

Now we are ready to push to the main branch, which will deploy the crawler in your virtual machine and start sending logs to your Serverless elasticsearch instance.

This will trigger the GitHub Action, which will execute the deploy script within the virtual machine and start crawling.

You can confirm the action was executed by going to the GitHub repository and visiting the “Actions” tab:

The actions deploy to EC2 tab in a GitHub repository.

Making changes and re-deploying

Something you may have noticed is that the price of each product is part of the document’s body field. It would be ideal to store the price in a separate field so we can run filters against it.

Let’s add this change to the crawler.yml file to use extraction rules to extract the price from the product-price CSS class:

We also see that the price includes a dollar sign ($), which we must remove if we want to run range queries. We can use an ingest pipeline for that. Note that we are referencing it in our new crawler config file above:

We can run that command in our production elasticsearch cluster. For the development one, as it is ephemeral, we can make the pipeline creation part of the docker-compose.yml file by adding the following service. Note that we also added a depends_on to the crawler service so it starts after the pipeline is successfully created.

Now let’s run `./local.sh` to see the change locally:

Running `./local.sh` to see the price change locally.

Great! Let’s now push the change:

To confirm everything works, you can check your production Kibana, which should reflect the changes and show price as a new field without the dollar sign.

Conclusion

The Elastic Open Web Crawler allows you to manage your crawler as code, meaning you can automate the full pipeline—from development to deployment—and add ephemeral local environments and testing against the crawled data programmatically, to name a few examples.

You are invited to clone the official repository and start indexing your own data using this workflow. You can also read this article to learn how to run semantic search on indices produced by the crawler.

Report an issue