Best practices for index generation and storage

General concerns

Aggregated events in the order of hundreds of millions
- Nothing yet
Relatively few fields
- Defining too many fields in an index is a condition that can lead to a mapping explosion, which can cause out of memory errors and difficult situations to recover from. Particularly occurs when many heterogeneous documents are indexed.
  - https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html#mapping-limit-settings
- The more fields a query_string or multi_match query targets, the slower it is. A common technique to improve search speed over multiple fields is to copy their values into a single field at index time, and then use this field at search time.
  - https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html
Using the string keyword type for identifiers
- The fact that some data is numeric does not mean it should always be mapped as a numeric field. The way that Elasticsearch indexes numbers optimizes for range queries while keyword fields are better at term queries. Typically, fields storing identifiers such as an ISBN or any number identifying a record from another database are rarely used in range queries or aggregations. This is why they might benefit from being mapped as keyword rather than as integer or long.
  - https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-search-speed.html
Can Logstash be used to decode JSON Web Tokens?
- JWT is base64 encoded: header.payload.signature . There exists a 'Cipher' plugin that can do base64 codec, but it won't validate the signature.
  - https://www.elastic.co/guide/en/logstash/current/plugins-filters-cipher.html
More generally, how do we do custom scripting in Logstash?
- Python language plugin support has been deprecated as of ES 5.0.0
  - https://www.elastic.co/guide/en/elasticsearch/plugins/5.6/lang-python.html
- Ruby is still in place; Logstash plugins are written as Ruby Gems
- ES recommends scripting in Painless
  - https://www.elastic.co/guide/en/logstash/current/transformation.html
Analysis required for report generation
- Tips for increasing search speed
  - https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html

Issues to consider

Index segmentation. There seems to be a common practice of many smaller indices with queries that operate across them (e.g. index prefix with a wildcard for the index name)
- Most APIs that refer to an index parameter support execution across multiple indices
  - https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-index.html
Suggested number of shards to be deployed
- The challenge is to figure out the right number of shards, because you only get to make the decision once per index. The challenge is to figure out the right number of shards, because you only get to make the decision once per index. The Elastic team recommends starting with one shard, sending “realistic” traffic, and seeing where it breaks. Then add more shards and retest until you find the right number. The key is to pick some kind of a timescale. You will eventually have to reshard; the only question is when.
  - https://www.signalfx.com/blog/scaling-elasticsearch-sharding-availability-hundreds-millions-documents/
Hardware suggestions, e.g. single or multiple VMs, memory, disk, CPU balance
- Nothing yet

Outline some key best practices for index creation and management.

https://www.elastic.co/guide/en/elasticsearch/reference/current/general-recommendations.html

What is the correct procedure for defining an index so that field types are optimally defined?

As far as integer types (byte, short, integer and long) are concerned, you should pick the smallest type which is enough for your use-case. This will help indexing and searching be more efficient. Note however that storage is optimized based on the actual values that are stored, so picking one type over another one will have no impact on storage requirements.
- https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html#_which_type_should_i_use
Existing field mappings cannot be updated. Changing the mapping would mean invalidating already indexed documents. Instead, you should create a new index with the correct mappings and reindex your data into that index.
- https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
To assign specific data types to individual fields, the mapping must be created when the index is created:

    indexconfig = {
    "mappings": {
      "apacheLine" : {
        "properties": {
          "sessionid": {"type": "long"},
          "searchevent": {"type": "boolean"}
        }
      }
    }
  }
  ES.indices.create(indexname, indexconfig)

Maintenance

https://e-mc2.net/elasticsearch-common-maintenance-tasks
https://www.elastic.co/guide/en/elasticsearch/client/curator/current/about.html
https://github.com/elastic/curator/blob/master/docs/asciidoc/about.asciidoc
You will eventually have to reshard; the only question is when.
- https://www.signalfx.com/blog/scaling-elasticsearch-sharding-availability-hundreds-millions-documents/

Outline of current process

Apache Logs

Any information we should know about the Apache logs?
Source machine(s)?
File-based data source for the following:

Filebeat

[Dave will provide details]
Polling frequency?

Logstash

The Logstash config files in order of execution:
- 10-beats-input.conf
```
Here's what it does
```
  and here's why
- 20-searchlog-filter.conf
```
Here's what it does
```
  and here's why
- 21-eventlog-filter.conf
```
Here's what it does
```
  and here's why
- 30-elasticsearch-output.conf
```
Here's what it does
```
  and here's why

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for index generation and storage

General concerns

Issues to consider

Maintenance

Outline of current process

Apache Logs

Filebeat

Logstash

Elasticsearch Index

Clone this wiki locally