Kibana Alerting: Breaking past scalability limits & unlocking 50x scale

Kibana alerting has been the monitoring solution of choice for many large organizations over the last few years. As adoption has continued to grow, so has the number of alerting rules users have created to monitor their systems. With more organizations relying on Kibana for alerting at scale, we have seen an opportunity to improve efficiency and ensure sufficient performance for future workload needs.

Between Kibana 8.16 and 8.18, we tackled these issues head-on, introducing key improvements that shattered previous scalability barriers. Before these enhancements, Kibana Alerting could only support up to 3,200 rules per minute with at least 16 Kibana nodes before hitting significant performance bottlenecks. By Kibana 8.18, we’ve increased the scalability ceiling of rules per minute by 50x, supporting up to 160,000 lightweight alerting rules per minute. This was achieved by making Kibana efficiently scale beyond 16 Kibana nodes and increasing per-node throughput from 200 to up to 3,500 rules per minute. These enhancements make all alerting rules run faster, with fewer delays and more efficiently.

In this blog, we’ll explore the scaling challenges we overcame, the key innovations that made it possible, and how you can leverage them to run Kibana Alerting at scale efficiently.

How Kibana Alerting scales with Task Manager

Kibana Alerting allows users to define rules that trigger alerts based on real-time data. Behind the scenes, the Kibana Task Manager schedules and runs these rules.

The Task Manager is Kibana’s built-in job scheduler, designed to handle asynchronous background tasks separately from user interactions. Its key responsibilities include:

Running one-time and recurring tasks such as alerting rules, connector actions, and reports.
Dynamically distributing workloads as Kibana background nodes join or leave the cluster.
Keeping the Kibana UI responsive by offloading tasks to dedicated background processes.

Each alerting rule translates into a recurring background task. Each background task is an Elasticsearch document, meaning it is stored, fetched and updated as an Elasticsearch document. As the number of alerting rules increases, so do the background tasks Kibana must manage. However, each Kibana node has a limit on how many tasks it can handle simultaneously. Once capacity is reached, additional tasks must wait, leading to delays and slower task run times.

The problem: Why scaling was limited

Before these improvements, Task Manager faced several scalability constraints, preventing it from scaling beyond 3,200 tasks per minute and 16 Kibana nodes. At this scale, we observed diminishing returns as contention and resource inefficiencies limited further scale. These numbers were based on internal performance testing using a basic Elasticsearch query alerting rule performing a no-op query. The diminishing returns observed included:

Task claiming contention

Task Manager uses a distributed polling approach to claim tasks within an Elasticsearch index. Kibana nodes periodically query for tasks and attempt to claim them using Elasticsearch’s optimistic concurrency control, which prevents conflicting document updates. If another node updates the task first, the original node drops it, reducing overall efficiency.

With too many Kibana nodes competing for tasks, document update conflicts increase drastically, limiting efficiency beyond 16 nodes and reducing system throughput.

Inefficient per-node throughput

Each Kibana node has a limit on the number of tasks that can run concurrently (default: 10 tasks at a time) to prevent memory and CPU overload. This safeguard often results in underutilized CPU and memory, requiring more nodes than necessary.

Additionally, the polling interval (default: 3000ms) defines how often Task Manager claims new tasks. A shorter interval reduces task delays but increases contention as nodes compete more for updates.

Kibana alerting Inefficient per-node throughput

Resource inefficiencies

When running a high volume of alerting rules, Kibana nodes perform repetitive Elasticsearch queries, repeatedly loading the same objects and lists for each alerting rule run, consuming more CPU, memory, and Elasticsearch resources than necessary. Scaling up requires costly infrastructure expansions to support the increasing request loads.

Why it’s important

Breaking these barriers is crucial for Kibana’s continued evolution. Improved scalability unlocks:

Cost optimization: Reducing infrastructure costs for large-scale operations.
Faster recovery: Enhancing Kibana’s ability to recover from node or cluster failures.
Future expansion: Enabling scalability for additional workloads, such as scheduled reports and event-driven automation.

Key innovations in Kibana Task Manager

To achieve a 50x scalability boost, we introduced several innovations:

Kibana discovery service: smarter scaling

Previously, Kibana nodes were unaware of each other’s presence, leading to inefficient task distribution. The new Kibana discovery service dynamically monitors active nodes and assigns task partitions accordingly, ensuring even load distribution and reducing contention.

Task partitioning: eliminating contention

To prevent nodes from competing for the same tasks, we introduced task partitioning. Tasks are now distributed across 256 partitions, ensuring only a subset of Kibana background nodes attempt to claim the same tasks at any given time. By default, each partition is assigned to a maximum of two Kibana nodes, while a single Kibana node can be responsible for multiple partitions.

Task costing: smarter resource allocation

Not all background tasks consume the same resources. We implemented a task costing system that assigns task weights based on CPU and memory usage. This allows Task Manager to dynamically adjust the number of tasks to claim, optimize resource allocation, and ensure efficient performance.

New task claiming algorithm

The old algorithm relied on update-by-query with forced index refresh to identify claimed tasks. This approach was inefficient and introduced unnecessary load on Elasticsearch. The new algorithm avoids this by searching for tasks without requiring an immediate refresh. Instead, it performs the following operations on the task manager index; a _search to find candidate tasks, followed by an _mget which returns documents that may have been updated more recently but are not yet reflected in the refreshed index state. By comparing document versions from _search and _mget results, it discards mismatches before proceeding with bulk updates. This approach increases efficiency in Elasticsearch and offers finer control to support task costing.

By factoring in the poll interval, task concurrency and the index refresh rate, we can calculate the upper limit of expected conflicts and adjust the _search page size accordingly. This helps ensure enough tasks are retrieved so the _mget doesn’t discard all the search results due to document version mismatches.

Kibana Task Manager- New task claiming algorithm

More frequent polling for tasks

By ensuring a fixed number of nodes compete for the same tasks with task partitioning and a new lightweight task claiming algorithm, Task Manager can now poll for tasks more frequently without additional stress on Elasticsearch. This reduces delays between a task completing and the next one starting, increasing overall system throughput.

Performance optimizations in Kibana Alerting

Before our optimizations using Elastic APM, we analyzed alerting rule performance and found that the alerting framework required at least 20 Elasticsearch queries to run any alerting rule. After the optimizations, we reduced this to just 3 queries - an 85% reduction, significantly improving run times and reducing CPU overhead.

Additionally, Elasticsearch previously relied on the resource-intensive pbkdf2 hashing algorithm for API key authentication, introducing excessive overhead at scale. We optimized authentication by switching to the more efficient SHA-256 algorithm, allowing us to eliminate the use of an internal Elasticsearch cache that was severely limited by the number of API keys used concurrently.

Impact: How users are benefiting

Early adoption has demonstrated:

50% faster rule run times, reducing overall system load.
Increased task capacity, enabling more tasks to run on existing infrastructure.
Fewer under-provisioned clusters, minimizing the need for scaling infrastructure to meet demand.

Drop in average task delay because of increased per-node throughput and making the cluster properly provisioned

Drop in rule run duration because of alerting framework optimizations

Kibana task manager: Drop in rule run duration

Drop in Elasticsearch requests because of alerting framework optimizations

Kibana task manager: Drop in Elasticsearch requests

Getting started: How to scale efficiently

Upgrading to Kibana 8.18 unlocks most of these benefits automatically. For additional optimization, consider adjusting the xpack.task_manager.capacity setting to maximize per-node throughput while ensuring p999 resource usage remains below 80% for memory, CPU, and event loop utilization and below 500ms for event loop delay.

By default, Kibana has a guardrail of 32,000 alerting rules per minute. If you plan to exceed this limit, you can modify the xpack.alerting.rules.maxScheduledPerMinute setting accordingly.

The new xpack.task_manager.capacity setting makes Kibana handle workload distributions more effectively, making the following settings unnecessary in most cases and should be removed from your kibana.yml settings:

xpack.task_manager.max_workers
xpack.task_manager.poll_interval

If you’re running Kibana on-prem and want to isolate background tasks into dedicated nodes, you can use the node.roles setting to separate UI-serving nodes from those handling background tasks. If you’re using Kibana on Elastic Cloud Hosted (ECH), scaling to 8GB or higher will automatically enable this isolation.

What’s next?

We’re not stopping at 50x. Our roadmap aims for 100x+ scalability, further eliminating Elasticsearch bottlenecks.

Beyond scaling, we’re also focusing on improving system monitoring at scale. Upcoming integrations will provide system administrators with deeper insights into background task performance, making it easier to decide when and how to scale.

Additionally, with task costing, we plan to increase task concurrency for Elastic Cloud Hosted (ECH) customers when configured with more CPU and memory (e.g., Kibana clusters with 2GB, 4GB, or 8GB+ of memory).

Stay tuned for even more advancements as we continue to push the limits of Kibana scalability!

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

Report an issue