Skip to content

Elasticsearch

sgiven96 edited this page Sep 26, 2017 · 1 revision

Overview

Elasticsearch provides scalable search, has near real-time search, and supports multitenancy. It is distributed, which means that indices can be divided into shards and each shard can have zero or more replicas. Each node hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard. Rebalancing and routing are done automatically. Elasticsearch uses Lucene and tries to make all its features available through the JSON and Java API. It supports faceting and percolating, which can be useful for notifying if new documents match for registered queries. Elasticsearch supports real-time GET requests, which makes it suitable as a NoSQL datastore.

A few notes on maintaining cluster health: A cluster optimized to the smallest number of server instances needed to run when everything's going fine will fail completely under load when one node is lost or temporarily inconvenienced. The larger the cluster, the greater the odds per unit time that any one of its nodes will suffer such an event. Splitting out data by character, such as usage and most especially rate of updates, can be very helpful. Put the data that must update frequently into one cluster and the more static data into another cluster, for example. Once data is split into clusters in this way, optimizing for each use case becomes much easier - very different configurations, arrangements of data, and classes of server instance might be used, something that cannot be achieved with everything lumped into a single cluster. Use bulk indexing requests for optimal performance. Bulk sizing is dependent on your data, analysis, and cluster configuration, but a good starting point is 5–15 MB per bulk. Note that this is physical size. Document count is not a good metric for bulk size.

Terms:

Nodes refer to a single running instance of Elasticsearch. Single physical and virtual server accommodates multiple nodes depending upon the capabilities of their physical resources like RAM, storage, and processing power.

Clusters are a collection of one or more nodes. Clusters provides collective indexing and search capabilities across all nodes for entire data.

Indexes are a collection of different type of documents and document properties. Indexes also use the concept of sharding to improve their performance.

Types/Mappings are a collection of documents sharing a set of common fields present in the same index. For example, an Index contains data of a social networking application, and there can be a specific type for user profile data, another type for messaging data and another for commenting data.

Documents are a collection of fields in a specific manner defined in JSON format. Every document belongs to a type, and resides inside an index. Every document is associated with a unique identifier, called the UID. Shards are horizontally subdivisions of Indexes. This means each shard contains all the properties of a document, but contains a less number of JSON objects than an index.

Elasticsearch allows a user to create Replicas of their indexes and shards. Replication not only helps in increasing the availability of data in case of failure, but also improves the performance of searching by carrying out a parallel search operation in these replicas. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards). The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically at any time, but you cannot change the number of shards after-the-fact.

Clone this wiki locally