Home » Elasticsearch Module

Elasticsearch Module

Elasticsearch is composed of modules, which are responsible for its functionality. With the help of these modules, elasticsearch perform its functionality. We will discuss several modules in this chapter. These modules have two settings that can be static and dynamic.

Static Settings:

  • Static settings need to be configured.
  • These settings must be set at the node level and on every relevant node.
  • We can configure static settings in config file (elasticsearch.yml) before starting Elasticsearch.
  • We can also set these settings on command line or as an environment variable when starting a node.
  • To reflect the changes made by these settings, we have to update all the concerned nodes in the cluster.

Dynamic Settings:

  • Dynamic settings can be set dynamically in elasticsearch.

We can update these settings on a live cluster with the cluster-update-setting API in elasticsearch.

Module Description
Cluster level routing and shard allocation It is responsible for providing the settings that control all the activities of shards and nodes. This means that these settings control when, where, and how shards are allocated to nodes.
Discovery It is responsible for discovering a cluster. It also maintains the state of all nodes present in the cluster. Nodes discover each other and also form the cluster.
Gateway As the discovery module maintains the state of nodes. Similarly, the gateway module maintains the state of the cluster. It manages the shards throughout the full cluster while restart.
HTTP Manage the communication between the Elasticsearch API and HTTP client.
Indices It helps to maintain the settings that are set globally for all indexes.
Network It controls default network settings as elasticsearch binds to the localhost.
Node Client A node client acts as a master node. It starts as well as joins a node in a cluster but cannot hold data.
Plugin Basic elasticsearch functionalities are enhanced by the plugin in a custom manner.
Painless A scripting language is designed for elasticsearch to be secured as much as possible.
Scripting Scripting enables the user to use a script and evaluate the custom expressions
Snapshot or Restore Snapshots can be created for entire cluster as well as for individual indices into a remote repository. It is used for data backup.
Thread pools A node stores several thread pools, which helps to improve the thread memory consumption that is managed within a node.
Transport In elasticsearch, the transport layer is used for communication between clusters. The transport networking layer needs to be configured.
Tribe Nodes It acts as a federated client across the cluster and also responsible for joining the clusters.
Cross-Cluster Search It allows executing the search request query on multiple clusters. It does not require to join the cluster to execute this request. Same as the tribe node, cross-cluster search also acts as a federated client.

We will discuss each of them in details –

Shard Allocation and Cluster-Level Routing

Cluster level settings decide the shards allocation to different nodes. These settings also decide the reallocation of shards to rebalance the cluster. Following are some settings that used to control shard allocation –

Cluster Level Shard Allocation

Following are a list of settings for cluster-level shard allocation along with its possible values and description:

Setting Possible value Description
cluster.routing.allocation.enable All It is the default value for this setting that allows the shard allocation for all types of shards.
None One of the possible values for this setting is none that does not allocate any shard.
Primaries As the possible value is primaries, it allows the shards allocation only for primary shards.
new_primaries Like the primary value, new_primaries is also responsible for shard allocation. It allocates shards only for primary shards and new indices.
cluster.routing.allocation.node_concurrent_recoveries Numeric value allowed (default is 2) This setting restricts the recovery of concurrent shards.
cluster.routing.allocation.node_initial_primaries_recoveries Numeric value allowed (default is 4) This setting restricts how many parallel initial primaries will recover.
cluster.routing.allocation.same_shard.host Boolean value allowed (default value is false) In the same physical node, it restricts the allocation of multiple replicas of the same shard.
indices.recovery.concurrent_streams Numeric value allowed (default is 3) At the time of shard recovery from the peer shards, it controls the number of open network stream.
Indices.recovery.concurrent_small_file_streams Numeric value allowed (default is 2) For small files, it controls the number of open streams per node. At the time of shard recovery, the size of this small file is less than 5 Mb.
Cluster.routing.rebalance.enable All Allows balancing for all kinds of shards.
None Any kind of shard balancing is not allowed by it.
Primary This setting allows shard balancing only for primary shards, not for all.
Replica As the name specifies, shard balancing is allowed only for replica shards.
cluster.routing.allocation.cluster_concurrent_rebalance Numeric value allowed (default is 2) The number of concurrent shard balancing is restricted by this setting in the cluster.
cluster.routing.allocation. balance.shard Only float value allowed (default is 0.45f) On each node, it defines the weight factor for shards allocation.
cluster.routing.allocation. balance.index Float value allowed (default is 0.55f) It helps to define the ratio of the number of shards per node allocated on a specific node.

 

cluster.routing.allocation. balance.threshold Float but only non-negative value allowed (default is 1.0f) It is the minimum optimization value of operation.
cluster.routing.allocation.allow_rebalance Always This is the default value for this setting that always allows rebalancing.
Indices_primaries_active When all the primary shards are allocated in a cluster, it allows rebalancing.
Indices_all_active When all the primary and replica shards are assigned, it allows rebalancing.

Disk-based Shard Allocation

After the cluster level shard allocation setting, we will talk about the disk-based shard allocation. Following are a list of settings for disk-based shard allocation along with its possible values and description as well:

Setting Possible value Description
cluster.routing.allocation.disk.threshold_enabled Boolean value It accepts Boolean (true or false) value to enable and disable the disk allocation decider. By default, this value is true.
cluster.routing.allocation.disk.watermark.low String This disk-based setting indicates the maximum usage of the disk. After this point, it is not allowed to allocate any other shard to that disk. It accepts string values, and by default, it is 85%.
cluster.routing.allocation.disk.watermark.high String This setting indicates the maximum utilization of disk at the time of allocation. Elasticsearch allocates that shard to another disk if this point has reached the time of allocation. By default, its value is 90%.
cluster.info.update.interval String It indicates the interval between the disk usages and checkups. The default interval value is 30s.
Cluster.routing.allocation.disk.include_realocations Boolean Value This setting helps to decide that – while calculating the disk utilization, whether we should consider the shard that is currently being allocated. For this, it accepts a Boolean value, which is true by default.

Discovery

This module basically helps in discovery of the clusters. With the help of this module, we can discover a cluster and maintain the state of all nodes available in it. So, whenever a node is added or deleted from the cluster, the state of that cluster changes. The cluster name setting creates a logical difference between multiple clusters.

The cloud vendor provides some modules that help us to use the APIs. These modules are as follows –

  • Google compute engine discovery
  • Azure discovery
  • Zen discovery
  • EC2 discovery

Gateway

This module helps to maintain the cluster state as well as manages the shard data across the full cluster restart. Following are some static settings of the gateway module with its possible values and description –

Setting Possible value Description
gateway.expected_nodes numeric The default possible value for this setting is 0 (zero). For the recovery of local shards, it is the number of nodes being expected in the cluster.
gateway.expected_master_nodes Accept numeric value It is the number of master nodes that are expected in the cluster before recovery begins. The default value for this setting is 0.
gateway.expected_data_nodesAccept numeric value It is the number of data nodes that are expected in the cluster before recovery begins. By default, it takes 0 for this setting.
gateway.recover_after_time Accept string This setting indicates the interval between disk utilization and checkups.
cluster.routing.allocation.disk.include_relocations Boolean value Basically, this setting is used to specify the time for which the recovery process will wait to start without worrying about the number of nodes included in the cluster.

HTTP

  • HTTP module is responsible for managing the communication between Elasticsearch APIs and HTTP client.
  • This module can be disabled if required and enabled back too. We can disable it by changing the enabled value to false.
  • There is a list of settings that need to be configured to control this module. These settings are available in yml file.

Below is a list of different http settings with description –

Sr.No Setting Description
1 http.port It is the http port used to access elasticsearch on web. The default port number is 9200. Its range is between 9200-9300.
2 http.bind_host This http.bind_host is a host address for http services.
3 http.publish_port This port is used for http client. In case of firewall, it is also useful.
4 http.publish_host Similar to the http.bind_host, it is a host address. This host address is for http client.
5 http.max_content_length This is used to set the maximum size of the content in an http request. The default size for it is 100mb.
6 http.max_initial_line_length This is used to specify the maximum size of URL. The default size of it is 8kb.
7 http.max_header_size This specifies the maximum size of the http header. By default, its value is 8kb.
8 http.compression The default value of this setting is false. This setting is used to enable or disable the support for compression.
9 http.pipelining The http.pipelining setting is used to enable or disable the HTTP pipelining.
10 http.pipelining.max_events Before shutting down the http request, this setting helps to limit the number of events to be queued.

Indices

This module helps to maintain settings for every index, which are set globally. There are a few settings that we will discuss, mainly related to memory usage. These settings are as follows –

Circuit Breaker

  • There are several circuit breakers in Elasticsearch.
  • This circuit breaker setting is used to prevent all the operations due to OutOfMemoryError.
  • It mainly controls the JVM heap size using indices.breaker.total.limit setting.
  • By default, it is 70% of the JVM heap.

Fielddata Cache

  • This fielddata setting is used while aggregating on a field.
  • It must have enough memory to allocate it.
  • The amount of memory can be controlled by using fielddata.cache.size setting.
  • This memory is used for field data cache.

Node Query Cache

  • Node query cache memory is used to cache the result of queries.
  • It uses LRU (Least Recently Used) eviction policy.
  • All shards share one query cache per node.
  • To control the memory size of this cache, queries.cache.size setting is used.

Indexing Buffer

  • The indexing buffer is used to store the newly created document in the index.
  • Whenever the buffer gets full, it flushes the documents.
  • The indices.memery.index_buffer_size setting helps to control the amount of heap, which is allocated for this buffer to store the document.

Shard Request Cache

  • The shard request cache holds the local search data for each shard.
  • By default, it can cache the result of the search request.
  • Elasticsearch allows us to enable and disable the cache.
  • We can enable the cache while creating an index. By sending the URL parameter, the cache can be disabled too.

Indices Recovery

  • It is responsible for recovering the resources during the recovery process. These are some following settings (with its default values) used to control the resources –
Setting Default value
indices.recovery.compress 40MB
indices.recovery.concurrent_streams 3
indices.recovery.concurrent_small_file_streams 2
indices.recovery.file_chunk_size 512KB
indices.recovery.translog_ops 1000
indices.recovery.translog_size 512KB
indices.recovery.max_bytes_per_sec true

TTL Interval

TTL interval refers to as time to live interval. The main objective of ttl interval is to define the time of a document, after which the document gets deleted. There are dynamic settings to control this process –

Setting Default value
Indices.ttl.interval 60s
indices.ttl.bulk_size 1000

Node

Each node has the option of being a data node or not. A node will be a data node if the setting has the value as false. Elasticsearch allows this property to be changed. By changing node.data setting, we can change this setting.


You may also like