Elasticsearch performance monitoring
On this article, we are going to look at Elasticsearch’s capabilities and potential use cases, and how to check its status. We’ll identify key metrics that you need to monitor to maintain the health and performance of your Elasticsearch cluster.
Use cases
Elasticsearch as a distributed tool is highly scalable and offers near real-time search capabilities. All of this adds up to a tool which can support a multitude of critical business needs and use cases. Here are some of the use cases:
- Log Collection and Aggregation
- Gathering and Indexing Large Datasets
- Collection and Management of Metrics and Event Data
- Analysis of Logs, Large Datasets, and Metric Data
- Full Text and Document Search Capabilities
Importance of Monitoring your Elasticsearch Cluster
As versatile, scalable and useful as Elasticsearch is, it’s essential that the infrastructure which hosts your cluster meets its needs, and that the cluster is sized correctly to support its data store and the volume of requests it must handle. Improperly sized infrastructure and misconfigurations can result in everything from sluggish performance to the entire cluster becoming unresponsive and crashing.
Six areas of concern to monitor in Elasticsearch
On the performance of elasticsearch clusters, you want to have a high index rate, large capacity, and fast query response. For those to happen, there are seven areas you should consider monitoring: search and query performance, indexing performance, node health, cluster health, node utilization, cache utilization and JVM health. Let's look into each area and discuss why each is integral to the health and performance of your cluster.
1. Cluster Health: Shards and Node Availability
An Elasticsearch cluster can consist of one or more nodes. A node is a member of the cluster, hosted on an individual server. Adding additional nodes is what allows us to scale the cluster horizontally. Indexes organize the data within the cluster. An index is a collection of documents which share a similar characteristic.
Consider the example of an Elasticsearch cluster deployed to store log entries from an application. An index might be set up to collect all log entries for a day. Each log entry is a document which contains the contents of the log and associated metadata.
In large datasets, the size of an index might exceed the storage capacity on a single node. We also want to ensure that we have redundant copies of our index, in case something happens to a node. Elasticsearch handles this by dividing an index into a defined number of shards. Elasticsearch distributes the shards across all nodes in the cluster. By default, an Elasticsearch index has five shards with one replica. The result of this default configuration is an index divided into five shards, each with a single replica stored on a different node.
It is essential to find the right number of shards for an index because too few shards may negatively affect search efficiency and distribution of data across the nodes. Conversely, too many nodes create an excessive demand on the resources of the cluster for their management.
When monitoring your Elasticsearch cluster, you can query the cluster health endpoint and receive information about the status of the cluster, the number of nodes, and the counts of active shards. You can also see counts for relocating shards, initializing shards and unassigned shards. An example response of such a request can be seen below.
Relocating and initializing shards indicate rebalancing on the cluster or the creation of new shards. Rebalancing occurs when a node is added or removed from the cluster and will affect the performance of the cluster. By understanding these metrics and how they affect Elasticsearch cluster health, you will have more insight into the cluster and can tune the cluster for better performance. One such adjustment is adding a shard relocation delay when a node leaves the cluster, eliminating excessive overhead if it returns quickly.
Each of these metrics requires continuous monitoring and an understanding of how the cluster manages its resources to ensure it remains healthy. We’ll also talk about monitoring the system health of each of the nodes in a subsequent section.
Important Metrics for Cluster Health
| Status | The status of the cluster: - Red: No shards have been allocated. - Yellow: Only the primary shards have been allocated. - Green: All shards have been allocated. | 
| Nodes | This metric includes the total number of nodes in the cluster, and includes the count of successful and failed nodes. | 
| Count of Active Shards | The number of active shards within the cluster. | 
| Relocating Shards | Count of shards being moved due to the loss of a node. | 
| Initializing Shards | Count of shards being initialized due to the addition of an index. | 
| Unassigned Shards | Count of shards for which replicas have not been created or assigned yet. | 
2. Search Query Performance Metrics: Request Rate and Latency
A data source is only as good as it is useful, and we can measure the effectiveness of the cluster by measuring the rate at which the system is processing requests and how long each request is taking.
When the cluster receives a request, it may need to access data from multiple shards, across multiple nodes. Knowing the rate at which the system is processing and returning requests, how many requests are currently in progress, and how long requests are taking can provide valuable insights into the health and performance of the Elasticsearch cluster.
The request process itself is divided into two phases. The first is the query phase, during which the cluster distributes the request to each shard (either primary or replica) within the index. During the second, fetch phase, the results of the query are gathered, compiled and returned to the user. Typically, the fetch phase takes less time than the query phase, so if it’s increasing, make sure to investigate since it may indicate problems with the Elasticsearch node itself or the underlying hardware.
We want to be aware of spikes in any of these metrics, as well as any emerging trends which might indicate growing problems within the cluster. These metrics are calculated by index and are available from the RESTful endpoints on the cluster itself.
Please refer to the table below for metrics which are available from the index
endpoint which is found at /index_name/_stats where index_name is the name of
the index. Performance specific metrics have been highlighted.
Important Metrics for Request Performance
| Query load | Total number of queries currently in progress, being processed by the cluster. | 
| Number of fetches currently in progress | Count of fetches in progress within the cluster. | 
| Total number of queries | Aggregated number of all queries processed by the cluster | 
| Total time spent on queries | Total time consumed by all queries in milliseconds. | 
| Total number of fetches | Aggregated number of all fetches processed by the cluster. | 
| Total time spent on fetches | Total time consumed by all fetches in milliseconds. | 
3. Indexing Performance Metrics: Refresh and Merge Times
There are various metrics that count together dictate the indexing performance of your Elasticsearch nodes and cluster.
Index Refresh
As documents are updated, added, and removed from an index, the cluster needs to continually update their indexes and then refresh them across all the nodes. All of this is taken care of by the cluster, and as a user, you have limited control over this process, other than to configure the refresh interval rate.
You should monitor the number and duration of the refresh operations. If time of the refresh increases it may mean that your cluster is not able to keep up with the operations and you may need to increase the refresh interval rate, so trade how quickly your data is visible for searching for stability.
Merge Times
Additions, updates, and deletions are batched and flushed to disk as new segments, and as each segment consumes resources, it is important for performance that smaller segments are consolidated and merged into larger segments. Like indexing, this is managed by the cluster itself.
The number of merge operations and their count can and should be measured. Increasing time and number of merges performed by Elasticsearch usually lowers the indexing throughput and is one of the common performance bottlenecks. In such cases configuration adjustments, rolling indices or reconsidering the sharding strategy may be needed.
Indexing Rate
Monitoring the Elasticsearch indexing rate of documents and merge time can help with identifying anomalies and related problems before they begin to affect the performance of the cluster. Considering these metrics in parallel with the health of each node can provide essential clues to potential problems within the system, or opportunities to optimize performance.
Index performance metrics can be retrieved from the /_nodes/stats endpoint and can be summarized at the node, index or shard level. This endpoint has a plethora of information, and the sections under merges and refresh are where you’ll find relevant metrics for index performance.
Important Metrics for Index Performance
| Total refreshes | Count of the total number of refreshes. | 
| Total time spent refreshing | Aggregation of all time spent refreshing. Measure in milliseconds. | 
| Current merges | Merges currently being processed. | 
| Total merges | Count of the total number of merges. | 
| Total time spent merging | Aggregation of all time spent merging segments. | 
4. Node Health: Memory, Disk, and CPU Metrics
Each Elasticsearch node runs off physical hardware and needs access to system memory, disk storage and CPU cycles for managing the data under its control and responding to requests to the cluster.
Node Memory Usage
Elasticsearch is a system which is heavily reliant on memory to be performant, and so keeping a close eye on memory usage is particularly relevant to the health and performance of each node. Configuration changes to improve metrics may also adversely affect memory allocation and usage, so it’s important to remember to view system health holistically.
Node CPU Usage
Monitoring the CPU usage for a node and looking for spikes can help identify inefficient processes like heavy search or indexing workload or potential problems within the node. CPU performance correlates closely to the garbage collection process of the Java Virtual Machine (JVM), which we’ll discuss next. Using software to correlate metrics from one area of concern to those of another can also help reduce false alarms, or better identify anomalies.
Disk Reads and Writes
Finally, high disk reads and writes can indicate a poorly tuned system. Since accessing the disk is an expensive process in terms of time, a well-tuned system should reduce disk I/O wherever possible.
These metrics are measured at the node level, and reflect the performance of
the instance or machine on which it is running. The most succinct source of
these metrics is from the /_cat/nodes endpoint of each node, and you can pass
in headers to define the metrics it should return. The following URL will return
all the metrics listed above.
Important Metrics for Node Health
| Total disk capacity | Total disk capacity on the node’s host machine. | 
| Total disk usage | Total disk usage on the node’s host machine. | 
| Total available disk space | Total disk space available. | 
| Percentage of disk used | Percentage of disk which is already used. | 
| Current RAM usage | Current memory usage (unit of measurement). | 
| RAM percentage | Percentage of memory being used. | 
| Maximum RAM | Total amount of memory on the node’s host machine. | 
| CPU | Percentage of the CPU in use. | 
Node Utilization: Thread Pools
Each Elasticsearch node uses several thread pools to perform, queue, and potentially reject operations performed. Searching, indexing, running cluster state requests, node discovery – those are only examples of operations that use bound thread pools. All of that is done to save resources. Each request requires a certain amount of memory and CPU power to be performed. Without limits, we could easily overload Elasticsearch nodes with unlimited requests and force the cluster into a state in which it can’t operate properly.
Before discussing which thread pool types you should consider monitoring we should talk about the metrics that are visible for each of them. There are three available metrics – the active, the queue, and rejected.
The active metric tells us how many active operations of a given type are there. For example, if the active is 10 for the search thread pool it means that the number of currently processed search, count, or suggest requests is 10. The queue metric determines how many operations of a given type are waiting to be processed. Finally, the rejected tells us how many operations of a given type were rejected and not processed. This means that there were no available active threads and the queue was also full. Having rejections is usually a sign of too much traffic sent to Elasticsearch or that the nodes can’t keep up with the load. However, keep in mind that the rejected metric is cumulative since the last restart, at least in Elasticsearch API response. So it is a good idea to look at the differences in time, not the value as it is.
As mentioned there are many thread pool available in Elasticsearch, but the ones that you will be likely the most interested in are:
- search – The thread pool for the count, search, and suggest operations sent to Elasticsearch. Seeing high numbers of rejected queries may mean that there are a lot of queries sent to your cluster and it can’t keep up with processing.
- write – The thread pool dedicated for indexing – single documents, bulks, and document updates. Of course, you don’t want to drop your indexing requests, so keep an eye on high numbers in the queue and rejections. Rejections in the write thread pool may mean data loss if your indexing client is not able to repeat the request or if it repeats a limited number of times and there were enough rejections.
- management – The thread pool that is dedicated to cluster management. You shouldn’t see much when it comes to the number of threads being used here, usually one or two threads. This is a scaling thread pool, which means that it doesn’t have a limited queue, but the queue is removing tasks after 5 minutes, by default. Seeing rejections in this thread pool means that your cluster is not able to process internal communication in a timely manner and is or soon will be in a very bad state.
Each thread pool has its own type, which can be configured. There is a fixed thread pool type, for example, the search and write thread pools use that type. It means that they have a fixed number of active threads available and a fixed queue. The request that doesn’t fit in the active threads and the queue are rejected. There is a fixed_auto_queue_size type with the size of the queue is automatically adjusted, but limited as well. Finally, there is a scaling type with a dynamic number of threads proportional to the current workload and adjusted on the basis of the minimum and maximum values set in the thread pool configuration and a keep-alive for threads that are waiting to be processed. The management thread pool uses the scaling type.
Important Metrics for Thread Pools
| Active | Number of active requests of a given type currently executed | 
| Queue | Number of queue requests of a given type waiting for execution | 
| Rejected | Number of rejected requests that Elasticsearch didn’t process | 
5. Caching: Field Data, Node Query and Shard Query Cache
The caches in Elasticsearch can play an important role when it comes to performance. Some are used to support aggregations on certain types of fields while others help with the query performance allowing query caching on the node and shard level. Of course, like with all the caches, the performance improvements don’t do not come for free, you need to account for resources that are needed to have the caches operational.
The Field Data Cache
As its names stand the field data cache is used for field data to enable using aggregations on selected text fields and also for the global ordinals. Both of the mentioned data structures are kept on-heap and the cache is by default unbounded, so it is very important to monitor the memory usage of this cache. Extensive use of this cache on large datasets may lead to memory issues and as a result, make your node and cluster unstable.
The Node Query Cache
The node query cache is shared between all the shards on a given node and caches the results of the queries used in the filter context. It can be very useful if your filter context queries are repeated – in such case the contents of the node query cache can be reused to increase the performance of the queries. By default, the node query cache is enabled and is allowed to use up to 10% of the memory of the node.
The Shard Request Cache
The shard request cache is used to cache the total number of hits, aggregations, and suggestions on a shard level. When a distributed request is run in Elasticsearch it is executed partially on the shards, so that they can return their local data to the coordinating node. Those shard level queries can potentially speed up in some cases by using the mentioned cache. However, remember that the search results themselves are not cached by this cache, as well as requests containing date math with now and non-deterministic scripted queries. This cache is enabled by default and can take up to 1% of the maximum heap size given to the node.
Important Metrics for Elasticsearch Caches
| Field Data Cache Size | The amount of memory used by the field data cache per node | 
| Node Query Cache Size | The amount of memory used by the node query cache | 
| Shard Request Cache Size | The amount of memory used by the shard request cache | 
6. JVM Health Metrics: Heap, GC, and Pool Size
As a Java-based application, Elasticsearch runs within a Java Virtual Machine (JVM). Seeing as each Java process has a limit on how much heap memory it can use, monitoring JVM heap usage is critical to ensure cluster performance. At the same time, JVM comes with garbage collection, which means that garbage collection frequency and duration are just as important to measure.
JVM heap
The JVM manages its memory within its ‘heap’ allocation and evicts objects from the head with a garbage collection process.
If the needs of the application exceed the capacity of the heap, the application is forced to begin using swap space on attached storage media. While this prevents the system from crashing, it can wreak havoc on the performance of the cluster. Monitoring the available heap space to ensure the system has sufficient capacity is essential to a healthy system.
Pool size
JVM memory is allocated to different memory pools. You’ll want to keep an eye on each of these pools to ensure that they have been adequately utilized and are not in danger of being overrun.
Garbage Collection
The garbage collector (GC) is much like a physical garbage collection service. We want to keep it running regularly and ensure that the system doesn’t overburden it. Ideally, a view of GC performance should indicate regular executions of a similar size. Spikes and anomalies can be indicators of much deeper problems. If you’re already monitoring your GC and want to improve performance, check out this guide about Java garbage collection tuning or see how to fine-tune the GC for Elasticsearch nodes.
The image above shows an example of the healthy “sawtooth” pattern we expect when monitoring memory usage on a healthy JVM on a single node. Routine and predictable increases and decreases in memory usage indicate that memory is being actively managed, and the garbage collector is being taxed beyond its abilities and affecting performance as a result.
JVM metrics can be retrieved from the /_nodes/stats endpoint.
Important Metrics for JVM Health
| Memory usage | Usage statistics for heap and non-heap processes and pools. | 
| Threads | Current threads in use, and maximum number. | 
| Garbage collection | Counts and total time spent with garbage collection. | 
Summary
So there you have it — the top Elasticsearch metrics to monitor:
- Cluster Health – Nodes and Shards
- Search Performance – Request Latency and
- Search Performance – Request Rate
- Indexing Performance – Refresh Times
- Indexing Performance – Merge Times
- Node Utilization – Thread Pools
- Caching – Field Data, Node Query and Shard Query Cache
- Node Health – Memory Usage
- Node Health – Disk I/O
- Node Health – CPU
- JVM Health – Heap Usage and Garbage Collection
- JVM health – JVM Pool Size