Elasticsearch

Topics related to Elasticsearch:

Getting started with Elasticsearch

Elasticsearch is an advanced open source search server based on Lucene and written in Java.

It provides distributed full and partial text, query-based and geolocation-based search functionality accessible through an HTTP REST API.

Python Interface

Cluster

Cluster Health provides a lot of information about the cluster, such as the number of shards that are allocated ("active") as well as how many are unassigned and relocating. In addition, it provides the current number of nodes and data nodes in the cluster, which can allow you to poll for missing nodes (e.g., if you expect it to be 15, but it only shows 14, then you are missing a node).

For someone that knows about Elasticsearch, "assigned" and "unassigned" shards can help them to track down issues.

The most common field checked from Cluster Health is the status, which can be in one of three states:

red
yellow
green

The colors each mean one -- and only one -- very simple thing:

Red indicates that you are missing at least one primary shard.

A missing primary shard means that an index cannot be used to write (index) new data in most cases.
- Technically, you can still index to any primary shards that are available in that index, but practically it means that you cannot because you do not generally control what shard receives any given document.
- Searching is still possible against a red cluster, but it means that you will get partial results if any index you search is missing shards.
In normal circumstances, it just means that the primary shard is being allocated (initializing_shards).
If a node just left the cluster (e.g., because the machine running it lost power), then it makes sense that you will be missing some primary shards temporarily.
- Any replica shard for that primary shard will be promoted to be the primary shard in this scenario.

Yellow indicates that all primary shards are active, but at least one replica shard is missing.

A missing replica only impacts indexing if consistency settings require it to impact indexing.

By default, there is only one replica for any primary and indexing can happen with a single missing replica.

In normal circumstances, it just means that the replica shard is being allocated (initializing_shards).

A one node cluster with replicas enabled will always be yellow at best. It can be red if a primary shard is not yet assigned.

If you only have a single node, then it makes sense to disable replicas because you are not expecting any. Then it can be green.
Green indicates that all shards are active.
- The only shard activity allowed for a green cluster is relocating_shards.
- New indices, and therefore new shards, will cause the cluster to go from red to yellow to green, as each shard is allocated (primary first, making it yellow, then replicas if possible, making it green).
  - In Elasticsearch 5.x and later, new indices will not make your cluster red unless it takes them too long to allocate.

Elasticsearch Configuration

Elasticsearch comes with a set of defaults that provide a good out of the box experience for development. The implicit statement there is that it is not necessarily great for production, which must be tailored for your own needs and therefore cannot be predicted.

The default settings make it easy to download and run multiple nodes on the same machine without any configuration changes.

Where are the settings?

Inside each installation of Elasticsearch is a config/elasticsearch.yml. That is where the following settings live:

cluster.name
- The name of the cluster that the node is joining. All nodes in the same cluster must share the same name.
- Currently defaults to elasticsearch.
node.*
- node.name
  - If not supplied, a random name will be generated each time the node starts. This can be fun, but it is not good for production environments.
  - Names do not have to be unique, but they should be unique.
- node.master
  - A boolean setting. When true, it means that the node is an eligible master node and it can be the elected master node.
  - Defaults to true, meaning every node is an eligible master node.
- node.data
  - A boolean setting. When true, it means that the node stores data and handles search activity.
  - Defaults to true.
path.*
- path.data
  - The location that files are written for the node. All nodes use this directory to store metadata, but data nodes will also use it to store/index documents.
  - Defaults to ./data.
    - This means that data will be created for you as a peer directory to config inside of the Elasticsearch directory.
- path.logs
  - The location that log files are written.
  - Defaults to ./logs.
network.*
- network.host
  - Defaults to _local_, which is effectively localhost.
    - This means that, by default, nodes cannot be communicated with from outside of the current machine!
- network.bind_host
  - Potentially an array, this tells Elasticsearch what addresses of the current machine to bind sockets too.
    - It is this list that enables machines from outside of the machine (e.g., other nodes in the cluster) to talk to this node.
  - Defaults to network.host.
- network.publish_host
  - A singular host that is used to advertise to other nodes how to best communicate with this node.
    - When supplying an array to network.bind_host, this should be the one host that is intended to be used for inter-node communication.
  - Defaults to network.host`.
discovery.zen.*
- discovery.zen.minimum_master_nodes
  - Defines quorum for master election. This must be set using this equation: (M / 2) + 1 where M is the number of eligible master nodes (nodes using node.master: true implicitly or explicitly).
  - Defaults to 1, which only is valid for a single node cluster!
- discovery.zen.ping.unicast.hosts
  - The mechanism for joining this node to the rest of a cluster.
  - This should list eligible master nodes so that a node can find the rest of the cluster.
  - The value that should be used here is the network.publish_host of those other nodes.
  - Defaults to localhost, which means it only looks on the local machine for a cluster to join.

What type of settings exist?

Elasticsearch provides three different types of settings:

Cluster-wide settings
- These are settings that apply to everything in the cluster, such as all nodes or all indices.
Node settings
- These are settings that apply to just the current node.
Index settings
- These are settings that apply to just the index.

Depending on the setting, it can be:

Changed dynamically at runtime
Changed following a restart (close / open) of the index
- Some index-level settings do not require the index to be closed and reopened, but might require the index to be forceably re-merged for the setting to apply.
  - The compression level of an index is an example of this type of setting. It can be changed dynamically, but only new segments take advantage of the change. So if an index will not change, then it never takes advantage of the change unless you force the index to recreate its segments.
Changed following a restart of the node
Changed following a restart of the cluster
Never changed

Always check the documentation for your version of Elasticsearch for what you can or cannot do with a setting.

How can I apply settings?

You can set settings a few ways, some of which are not suggested:

Command Line Arguments

In Elasticsearch 1.x and 2.x, you can submit most settings as Java System Properties prefixed with es.:

$ bin/elasticsearch -Des.cluster.name=my_cluster -Des.node.name=`hostname`

In Elasticsearch 5.x, this changes to avoid using Java System Properties, instead using a custom argument type with -E taking the place of -Des.:

$ bin/elasticsearch -Ecluster.name=my_cluster -Enode.name=`hostname`

This approach to applying settings works great when using tools like Puppet, Chef, or Ansible to start and stop the cluster. However it works very poorly when doing it manually.

YAML settings
- Shown in examples
Dynamic settings
- Shown in examples

The order that settings are applied are in the order of most dynamic:

Transient settings
Persistent settings
Command line settings
YAML (static) settings

If the setting is set twice, once at any of those levels, then the highest level takes effect.

Difference Between Indices and Types

It's easy to see types like a table in an SQL database, where the index is the SQL database. However, that is not a good way to approach types.

All About Types

In fact, types are literally just a metadata field added to each document by Elasticsearch: _type. The examples above created two types: my_type and my_other_type. That means that each document associated with the types has an extra field automatically defined like "_type": "my_type"; this is indexed with the document, thus making it a searchable or filterable field, but it does not impact the raw document itself, so your application does not need to worry about it.

All types live in the same index, and therefore in the same collective shards of the index. Even at the disk level, they live in the same files. The only separation that creating a second type provides is a logical one. Every type, whether it's unique or not, needs to exist in the mappings and all of those mappings must exist in your cluster state. This eats up memory and, if each type is being updated dynamically, it eats up performance as the mappings change.

As such, it is a best practice to define only a single type unless you actually need other types. It is common to see scenarios where multiple types are desirable. For example, imagine you had a car index. It may be useful to you to break it down with multiple types:

bmw
chevy
honda
mazda
mercedes
nissan
rangerover
toyota
...

This way you can search for all cars or limit it by manufacturer on demand. The difference between those two searches are as simple as:

GET /cars/_search

and

GET /cars/bmw/_search

What is not obvious to new users of Elasticsearch is that the second form is a specialization of the first form. It literally gets rewritten to:

GET /cars/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term" : {
            "_type": "bmw"
          }
        }
      ]
    }
  }
}

It simply filters out any document that was not indexed with a _type field whose value was bmw. Since every document is indexed with its type as the _type field, this serves as a pretty simple filter. If an actual search had been provided in either example, then the filter would be added to the full search as appropriate.

As such, if the types are identical, it's much better to supply a single type (e.g., manufacturer in this example) and effectively ignore it. Then, within each document, explicitly supply a field called make or whatever name you prefer and manually filter on it whenever you want to limit to it. This will reduce the size of your mappings to 1/n where n is the number of separate types. It does add another field to each document, at the benefit of an otherwise simplified mapping.

In Elasticsearch 1.x and 2.x, such a field should be defined as

PUT /cars
{
  "manufacturer": { <1>
    "properties": {
      "make": { <2>
        "type": "string",
        "index": "not_analyzed"
      }
    }
  }
}

The name is arbitrary.
The name is arbitrary and it could match the type name if you wanted it too.

In Elasticsearch 5.x, the above will still work (it's deprecated), but the better way is to use:

PUT /cars
{
  "manufacturer": { <1>
    "properties": {
      "make": { <2>
        "type": "keyword"
      }
    }
  }
}

The name is arbitrary.
The name is arbitrary and it could match the type name if you wanted it too.

Types should be used sparingly within your indices because it bloats the index mappings, usually without much benefit. You must have at least one, but there is nothing that says you must have more than one.

Common Questions

What if I have two (or more) types that are mostly identical, but which have a few unique fields per type?

At the index level, there is no difference between one type being used with a few fields that are sparsely used and between multiple types that share a bunch of non-sparse fields with a few not shared (meaning the other type never even uses the field(s)).

Said differently: a sparsely used field is sparse across the index regardless of types. The sparsity does not benefit -- or really hurt -- the index just because it is defined in a separate type.

You should just combine these types and add a separate type field.

Why do separate types need to define fields in the exact same way?

Because each field is really only defined once at the Lucene level, regardless of how many types there are. The fact that types exist at all is a feature of Elasticsearch and it is only a logical separation.

Can I define separate types with the same field defined differently?

No. If you manage to find a way to do so in ES 2.x or later, then you should open up a bug report. As noted in the previous question, Lucene sees them all as a single field, so there is no way to make this work appropriately.

ES 1.x left this as an implicit requirement, which allowed users to create conditions where one shard's mappings in an index actually differed from another shard in the same index. This was effectively a race condition and it could lead to unexpected issues.

Exceptions to the Rule

Parent/child documents require separate types to be used within the same index.
- The parent lives in one type.
- The child lives in a separate type (but each child lives in the same shard as its parent).
Extremely niche use cases where creating tons of indices is undesirable and the impact of sparse fields is preferable to the alternative.
- For example, the Elasticsearch monitoring plugin, Marvel (1.x and 2.x) or X-Pack Monitoring (5.x+), monitors Elasticsearch itself for changes in the cluster, nodes, indices, specific indices (the index level), and even shards. It could create 5+ indices each day to isolate those documents that have unique mappings or it could go against best practices to reduce cluster load by sharing an index (note: the number of defined mappings is effectively the same, but the number of created indices is reduced from n to 1).
- This is an advanced scenario, but you must consider the shared field definitions across types!

Curl Commands

Analyzers

Analyzers take the text from a string field and generate tokens that will be used when querying.

An Analyzer operates in a sequence:

CharFilters (Zero or more)
Tokenizer (One)
TokenFilters (Zero or more)

The analyzer may be applied to mappings so that when fields are indexed, it is done on a per token basis rather than on the string as a whole. When querying, the input string will also be run through the Analyzer. Therefore, if you normalize text in the Analyzer, it will always match even if the query contains a non-normalized string.

Search API

Learning Elasticsearch with kibana

Difference Between Relational Databases and Elasticsearch