-
Notifications
You must be signed in to change notification settings - Fork 1
Best practices for index generation and storage
-
Aggregated events in the order of hundreds of millions
- Nothing yet
-
Relatively few fields
-
Defining too many fields in an index is a condition that can lead to a mapping explosion, which can cause out of memory errors and difficult situations to recover from. Particularly occurs when many heterogeneous documents are indexed.
-
The more fields a query_string or multi_match query targets, the slower it is. A common technique to improve search speed over multiple fields is to copy their values into a single field at index time, and then use this field at search time.
-
-
Using the string keyword type for identifiers
-
The fact that some data is numeric does not mean it should always be mapped as a numeric field. The way that Elasticsearch indexes numbers optimizes for range queries while keyword fields are better at term queries. Typically, fields storing identifiers such as an ISBN or any number identifying a record from another database are rarely used in range queries or aggregations. This is why they might benefit from being mapped as keyword rather than as integer or long.
-
-
Can Logstash be used to decode JSON Web Tokens?
-
JWT is base64 encoded: header.payload.signature . There exists a 'Cipher' plugin that can do base64 codec, but it won't validate the signature.
-
-
More generally, how do we do custom scripting in Logstash?
-
Python language plugin support has been deprecated as of ES 5.0.0
-
Ruby is still in place; Logstash plugins are written as Ruby Gems
-
ES recommends scripting in Painless
-
-
Analysis required for report generation
- Tips for increasing search speed
-
Index segmentation. There seems to be a common practice of many smaller indices with queries that operate across them (e.g. index prefix with a wildcard for the index name)
- Most APIs that refer to an index parameter support execution across multiple indices
-
Suggested number of shards to be deployed
- The challenge is to figure out the right number of shards, because you only get to make the decision once per index. The challenge is to figure out the right number of shards, because you only get to make the decision once per index. The Elastic team recommends starting with one shard, sending “realistic” traffic, and seeing where it breaks. Then add more shards and retest until you find the right number. The key is to pick some kind of a timescale. You will eventually have to reshard; the only question is when.
-
Hardware suggestions, e.g. single or multiple VMs, memory, disk, CPU balance
- Nothing yet
Outline some key best practices for index creation and management.
What is the correct procedure for defining an index so that field types are optimally defined?
-
As far as integer types (byte, short, integer and long) are concerned, you should pick the smallest type which is enough for your use-case. This will help indexing and searching be more efficient. Note however that storage is optimized based on the actual values that are stored, so picking one type over another one will have no impact on storage requirements.
-
Existing field mappings cannot be updated. Changing the mapping would mean invalidating already indexed documents. Instead, you should create a new index with the correct mappings and reindex your data into that index.
-
To assign specific data types to individual fields, the mapping must be created when the index is created:
indexconfig = {
"mappings": {
"apacheLine" : {
"properties": {
"sessionid": {"type": "long"},
"searchevent": {"type": "boolean"}
}
}
}
}
ES.indices.create(indexname, indexconfig)
-
https://www.elastic.co/guide/en/elasticsearch/client/curator/current/about.html
-
https://github.com/elastic/curator/blob/master/docs/asciidoc/about.asciidoc
-
You will eventually have to reshard; the only question is when.
- Any information we should know about the Apache logs?
- Source machine(s)?
- File-based data source for the following:
- [Dave will provide details]
- Polling frequency?
- The Logstash config files in order of execution:
-
10-beats-input.conf
Here's what it does
and here's why
-
20-searchlog-filter.conf
Here's what it does
and here's why
-
21-eventlog-filter.conf
Here's what it does
and here's why
-
30-elasticsearch-output.conf
Here's what it does
and here's why
-