elasticsearch allows you to index data quickly and in a flexible manner. Try it free in the cloud or run it locally to see how easy indexing can be.
With Elastic Open Web Crawler and its CLI-driven architecture, having versioned crawler configurations and a CI/CD pipeline with local testing is now pretty straightforward to achieve.
Traditionally, managing crawlers was a manual, error-prone process. It involved editing configurations directly in the UI and struggling with cloning crawl configurations, rolling back, versioning, and more. Treating crawler configurations as code resolves this by providing the same benefits we expect in software development: repeatability, traceability, and automation.
This workflow makes it easier to bring the Open Web Crawler into your CI/CD pipeline for rollbacks, backups, and migrations—tasks that were much trickier with earlier Elastic Crawlers, such as the Elastic Web Crawler or App Search Crawler.
In this article, we are going to learn how to:
- Manage our crawl configs using GitHub
- Having a local setup to test pipelines before deploying
- Create a production setup to run the web crawler with new settings every time we push changes to our main branch
You can find the project repository here. As per writing, I’m using elasticsearch 9.1.3 and Open Web Crawler 0.4.2.
Prerequisites
- Docker Desktop
- elasticsearch instance
- Virtual machine with SSH access (e.g, AWS EC2) and Docker installed
Steps
- Folder structure
- Crawler configuration
- Docker-compose file (local environment)
- Github Actions
- Testing locally
- Deploying to prod
- Making changes and re-deploying
Folder structure
For this project, we will have the following file structure:
Crawler configuration
Under crawler-config.yml,
we will put the following:
This will crawl from https://web-scraping.dev/products, a mock site for products. We will only crawl the first three product pages. The max_crawl_depth
setting will prevent the crawler from discovering more pages than the ones defined as seed_urls
by not opening the links within them.
elasticsearch host
and api_key
will be populated dynamically depending on the environment in which we are running the script.

Docker-compose file (local environment)
For the local docker-compose.yml,
we will deploy the crawler and a single elasticsearch cluster + Kibana, so we can easily visualize our crawling results before deploying to production.
Note how the crawler waits until elasticsearch is ready to run.
Github Actions
Now we need to create a GitHub Action that will copy the new settings and run the crawler in our virtual machine on every push to main. This ensures we always have the latest configuration deployed, without having to manually enter into the virtual machine to update files and run the crawler. We are going to use AWS EC2 as the virtual machine provider.
The first step is to add the host (VM_HOST
), machine user (VM_USER
), SSH RSA key (VM_KEY
), elasticsearch host (ES_HOST
), and elasticsearch API Key (ES_API_KEY
) to the GitHub Action secrets:

This way, the action will be able to access our server to copy the new files over and run the crawl.
Now, let’s create our .github/workflows/deploy.yml
file:
This action will execute the following steps every time we push changes to the crawler configuration file:
- Populate elasticsearch host and API Key in the yml config
- Copy the config folder to our VM
- Connect via SSH to our VM
- Run the crawl with the config we just copied from the repo
Testing locally
To test our crawler locally, we created a bash script that populates the elasticsearch host with the local one from Docker and starts a crawl. You can run ./local.sh
to execute it.
Let’s look at Kibana DevTools to confirm the web-crawler-index
was populated correctly:

Deploying to prod
Now we are ready to push to the main branch, which will deploy the crawler in your virtual machine and start sending logs to your Serverless elasticsearch instance.
This will trigger the GitHub Action, which will execute the deploy script within the virtual machine and start crawling.
You can confirm the action was executed by going to the GitHub repository and visiting the “Actions” tab:

Making changes and re-deploying
Something you may have noticed is that the price
of each product is part of the document’s body field. It would be ideal to store the price in a separate field so we can run filters against it.
Let’s add this change to the crawler.yml
file to use extraction rules to extract the price from the product-price
CSS class:
We also see that the price includes a dollar sign ($
), which we must remove if we want to run range queries. We can use an ingest pipeline for that. Note that we are referencing it in our new crawler config file above:
We can run that command in our production elasticsearch cluster. For the development one, as it is ephemeral, we can make the pipeline creation part of the docker-compose.yml
file by adding the following service. Note that we also added a depends_on
to the crawler service so it starts after the pipeline is successfully created.
Now let’s run `./local.sh`
to see the change locally:

Great! Let’s now push the change:
To confirm everything works, you can check your production Kibana, which should reflect the changes and show price as a new field without the dollar sign.
Conclusion
The Elastic Open Web Crawler allows you to manage your crawler as code, meaning you can automate the full pipeline—from development to deployment—and add ephemeral local environments and testing against the crawled data programmatically, to name a few examples.
You are invited to clone the official repository and start indexing your own data using this workflow. You can also read this article to learn how to run semantic search on indices produced by the crawler.