ElasticSearch

分布式全文检索引擎

https://www.elastic.co

es 自动集群，默认集群名称：elasticsearch

安装

Docker

下载镜像

docker pull elasticsearch:7.12.0
docker pull kibaba:7.12.0

创建实例

# 创建 config 和 data 目录做 容器运行的映射
mkdir -p /Docker/elasticsearch/config
mkdir -p /Docker/elasticsearch/data
mkdir -p /Docker/elasticsearch/plugins
echo "http.host:0.0.0.0" >> /Docker/elasticsearch/config/elasticsearch.yml

# 9200 rest 访问暴露端口，集群模式下 9300 为节点的通信端口
# -v 挂载目录
# plugins 插件目录，如 ik 分词器

docker run --name elasticsearch -p 9200:9200 -p 9300:9300\
-e "discovery.type=single-node"\
-e ES_JAVA_OPTS="-Xms256m -Xmx256m"\
-v /d/Docker/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml\
-v /d/Docker/elasticsearch/data:/usr/share/elasticsearch/data\
-v /d/Docker/elasticsearch/plugins:/usr/share/elasticsearch/plugins\
-d elasticsearch:7.12.0

# 查看日志
docker logs containerId

# 自动启动
docker update containerId --restart=always

windows

配置文件只覆盖 elasticsearch.yml

docker run --name elasticsearch -v /d/Docker/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml -v /d/Docker/elasticsearch/data/:/usr/share/elasticsearch/data -v /d/Docker/elasticsearch/plugins/:/usr/share/elasticsearch/plugins -v /d/Docker/elasticsearch/logs/:/usr/share/elasticsearch/logs -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e ES_JAVA_OPTS="-Xms512m -Xmx512m" -d elasticsearch:7.12.0

docker update containerId --restart=always

Kibana

docker run --name kibana -e ELASTICSEARCH_HOSTS=http://192.168.0.104:9200 -p 5601:5601\
-d kibana:7.12.0

==检索进阶==

有两种方式

query

request body「query DSL」

query

GET bank/_search?q=*&sort=account_number:asc

query DSL

示例一

GET bank/_search
{
	"query": {
		"match_all": {}
	},
	# 排序
	"sort": [
		{
			"account_number": "asc",
			"balance": "desc"
		}
	],
	# 分页
	"from": 0,
	"size": 5,
	# 指定字段查询
	"_source": ["firstname", "balance"]
}

示例二

倒排索引进行全文检索

match

match_phase

multi-match

# 结果按 _score 排序
GET bank/_search
{
	"query": {
		# 对检索条件「mill movico」进行分词匹配
		"match": {
			"address": "mill movico"
		},
		
		# 不分词匹配
		"match_phase": {
			"address": "mill movico"
			# 精确匹配：mill movico cc 不能被查询到
			# "address.keyword": "mill movico"
		},
		
		# 多字段匹配
		"multi_match": {
			"query": "mill movico",
			# address 或者 city 字段匹配 mill movico
			"fields": ["address", "city"]
		}
	}
}

示例三：复合查询 bool

GET bank/_search
{
	"query": {
		"bool": {
			# 必须 类似 =
			"must": [
				{
          "match": {
            "gender": "F"
          }
				},
				{
					"match": {
						"address": "mill"
					}
				}
			],
			# 必须不 类似 !=
			"must-not": [
				{
					"match": {
						"age": "38"
					}
				}
			],
			# 应该满足 类似 or
			"should": [
				{
					"match": {
						"lastname": "Wallace"
					}
				}
			],
			"filter": {
				"range": {
					...
				}
			}
		}
	}
}

示例四：filter

filter 不贡献 score

must / must-not /should 会计算 score

GET bank/_search
{
	"query": {
		"bool": {
			# "must": {
			"filter": {
				"range": {
					"age": {
						"gte": 18,
						"lte": 30
					}
				}
			}
		}
	}
}

示例五：term

对于非 text 字段的精确匹配使用 term

对于 text 字段的精确匹配使用 column.keyword

GET bank/_seach
{
	"query": {
		"term": {
			"age": 28
		}
	}
}

聚合

映射

创建索引的映射

整数数值类型默认 Long

keyword 类型：不做全文检索，而是精确匹配

text 类型：在保存数据的时候进行分词，检索时按照分词全文检索

PUT /my_index
{
	"mappings": {
		"properties": {
			"age": {"type": "integer"},
			"email": {"type": "keyword"},
			"name": {"type": "text"}
		}
	}
}

添加索引的映射

PUT /my_index/_mapping
{
	"properties": {
		"type": "keyword",
		"index": false
	}
}

修改索引的映射

reindex

无法修改索引的映射

创建一个新的索引，指定正确的索引关系
数据迁移

# 1
PUT /newbank
{
	"mappings": {
		"properties": {
			"account_number": {
				"type": "long",
				...
			}
		}
	}
}

# 2
# 6.0 以后 _reindex
POST _reindex
{
	"source": {
		"index": "bank"
	},
	"dest": {
		"index": "newbank"
	}
}

# 6.0 以前的索引迁移到 6.0 以后
# type 都会修改为 _doc
POST _reindex
{
	"source": {
		"index": "bank",
		# 需要指定 type
		"type": "account"
	},
	"dest": {
		"index": "newbank"
	}
}

ES 概念

索引

对应关系型数据库的库

文档

对应关系型数据库表中一条记录

类型

~~默认"_doc"类型，已弃用~~

分片

分片：是ES中所有数据的文件快，也是数据的最小单元，整个ES集群的核心就是对所有分片进行分布，索引，负载，路由，分片平均的存储整个集群的所有数据

默认5个分片（primary shard，主分片），每个主分片将有一个副本（repica shard，复制分片）

主分片与复制分片会在不同节点中，==一个分片是一个Lucene索引（一个包含倒排索引的文件目录）==

倒排索引：倒排索引包含一个有序列表，列表包含所有文档出现过的不重复个体（term，词元），每个词元包含了它所有曾出现过的文档列表

doc1： good good study, up up every day

doc2：day day up, good good study

term（词元）	doc1	doc2	倒排索引
day	√	√	1，2
every	√	×	1
good	√	√	1，2
study	√	√	1，2
up	√	√	1，2

分词

分词：一个 tokenizer「分词器」接受一个字符流，将其分割为独立的 token「词元」，然后输出 tokens 流

standard

# 标准分词器

POST _analyze
{
	"analyzer": "standard",
	"text": "尚硅谷电商项目"
}

# 结果
"尚"
"硅"
"谷"
"电"
"商"
"项"
"目"

ik 分词器

中文分词器：ik

ik_smart

id_max_word

# install 进入 es 安装目录 plugins
cd $ES_HOME$/plugins && mkdir ik
# 解压到ik文件夹
unzip elastic-search-analysis-ik-7.12.0.zip

# chmod
chmod -R 777 ik/

# 将我们需要的业务词元添加到自定义字典中
IKAnalyzer.cfg.xml中<entry key="ext_dict">my.dic</entry>

ik_smart

POST _analyze
{
	"analyzer": "ik_smart",
	"text": "我是中国人"
}

# 结果
我
是
中国人

id_max_word

POST _analyze
{
	"analyzer": "ik_max_word",
	"text": "我是中国人"
}

# 结果
我
是
中国人
中国
国人

自定义词库

nginx 存放词库

由 ik 分词器向 nginx 发起请求获取新词库

安装 nginx

随意启动一个 nginx 实例，只是为了复制其配置

docker run -p 80:80 --name nginx -d nginx:1.10

拷贝 nginx 运行的文件到宿主机

docker container cp nginx:/etc/nginx ./conf

复制 nginx 容器下 /etc/nginx 文件夹到当前目录的 conf 下

删除 nginx 实例

docker stop nginx docker rm nginx

执行以下命令

mkdir nginx
mv conf nginx/

docker run -p 80:80 --name nginx\
-v /yangzl/docker/nginx/html:/usr/share/niginx/html\
-v /yangzl/docker/nginx/logs:/var/log/nginx\
-v /yangzl/docker/nginx/conf:/etc/nginx\
-d nginx:1.10

此时，nginx 目录下有三个文件夹

conf

html

logs

继续创建文件夹 es

vim html/index.html

mkdir es
cd es
vim terms.txt

尚硅谷
巧碧螺

shift zz

继续回到 ik 分词器的 conf

cd ik/conf

vim IKAnalyzer.cfg.xml

# 远程字典配置
<entry key="remote_ext_dict">http://192.168.0.104:80/es/terms.txt</entry>

# 本地字典配置

Rest API

==API 地址：==es-api

PUT

创建（指定文档ID）：localhost:9200/idx/_doc/id

POST

创建（随机文档ID）：localhost:9200/idx/_doc

修改：localhost:9200/idx/_doc/doctId/_update

查询所有数据：localhost:9200/idx/_doc/_search

DELETE

删除：localhost:9200/idx/_doc/docId

GET

通过文档ID查询数据：localhost:9200/idx/type/docId

_cat

# 所有节点
GET /_cat/nodes
# es 健康状况
GET /_cat/health
# 查看 master 节点
GET /_cat/master
# 查看所有索引 show database;
GET /_cat/indices

保存文档

PUT「必须携带 ID」
POST

# index1 索引名称
# _doc「typeless」
# 1001 文档 id
PUT index1/_doc/1001

Postman
IDEA HTTP Client

###
POST http://127.0.0.1:9200/index1/_doc/1001
Content-Type: application/json

{
  "name": "Jone Doe"
}

乐观锁机制

### 乐观锁更新请求1
POST http://127.0.0.1:9200/index1/_doc/1001?if_seq_no=3&if_primary_term=1
Content-Type: application/json

{
  "name": "Jone Doe2"
}

### 乐观锁更新请求2
POST http://127.0.0.1:9200/index1/_doc/1001?if_seq_no=3&if_primary_term=1
Content-Type: application/json

{
  "name": "Jone Doe2"
}


# 第二个更新请求结果 409 
{
  "error": {
    "type": "version_conflict_engine_exception",
    "reason": "[1001]: version conflict, required seqNo [3], primary term [1]. current document has seqNo [4] and primary term [1]",
    "index_uuid": "e-eWHvqrRx2FDF4Fsyo4ng",
    "shard": "0",
    "index": "index1"
  },
  "status": 409
}

查询文档

Postman
IDEA HTTP Client

###
GET http://127.0.0.1:9200/index1/_doc/1001

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8

{
  "_index": "index1",
  "_type": "_doc",
  "_id": "1001",
  "_version": 3,
  "_seq_no": 3,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "name": "Jone Doe"
  }
}

# 业务可根据该字段自己做乐观更新
# ?if_seq_no=1&if_primary_term=1
_seq_no：并发控制字段，每个事务请求都会 +1，提供乐观锁机制

更新文档

# 一
# 对比数据是否变化，如果数据无变化，则返回 noop
# 需要将数据放在 doc 中
POST index1/_doc/1001/_update

{
	"doc": {
		"name": "John3"
	}
}

# 二
# 版本号 seq_no 变化
POST index1/_doc/1001

# 三
PUT index1/_doc/1001

删除

删除文档
删除索引

DELETE index1/_doc/1001

# 删除索引
DELETE index1

bulk API

POST index1/_doc/_bulk
Content-Type:application/json

# 两行为一个整体
{"index": {"_id": "1"}}
{"name": "John Doe"}
{"index": {"_id": "2"}}
{"name": "Jane Done"}

bulk CRUD

POST /_bulk
{"delete":{"_index": "index1", "_id": "1001"}}
{"create":{"_index": "index1", "_id": "1002"}}
{"title": "My first blog"}
{"index": {"_index": "index1"}}
{"title": "My second blog"}
{"update": {"_index": "index1", "_id": "1002"}}
{"doc": {"title": "My first update blog"}}

delete 单独一行

其他有提交数据的每两行为一个整体

数据类型

字符串
- text
- keyword，分词器不解析
数值类型

long，integer，short，byte，double，float，half_float，scaled_float
日期类型 date
布尔类型 boolean
二进制 binary

CRUD

# 在Kibana中发送Rest请求

# RESTful风格的创建

# 不指定数据类型
PUT /myidx/_doc/1
{
	"name": "yangzl",
	"age": 24
}

# 指定数据类型，keyword不被分词器解析
PUT /myidx
{
	”mappings“: {
		"properties": {
			"name": { "type": "keyword" },
			"desc": { "type": "text" }
		}
	}
}
PUT /myidx/_doc/1
{
	"name": "yangzl",
	"desc": "我是一颗小虎牙Python"
}

创建之后数据结构如下：

_index	_type	_id	_score	name	age
myidx	_doc	1	1	yangzl	24

# 修改
# PUT也可以，当数据缺失时会覆盖掉原数据
POST /myidx/_doc/1/_update
{
	"doc": {
		"name": "法外狂徒"
	}
}

# RESTful风格删除
DELETE /myidx # 删除索引
DELETE /myidx/_doc/1	# 删除文档

# 查询
# Query parameters
# h代表要查询的字段，以逗号分隔
# format: JSON, YAML, etc
# local: if true requests retrieve from local node only, else from the master node
# v: includes column headings or not
# s: 以列的别名排序字段名，多个以逗号分隔
GET /myidx/_doc/1
GET /myidx/_doc/1/_search?q=name:yangzl


#*****************************************
# 复杂查询（构建查询，使用Client时对应API）
GET /myidx/_doc/_search
{
	"query": {
    	"match": { "name": "yangzl" }
	},
	# 查询指定字段
	"_source": ["name", "age"]，
	# 排序
	“sort”: [
		{
			"age": { "order": "asc" }
		}
	],
	# 分页，同LIMIT
	"from": 0,
	"size": 10,
	# 高亮结果
	"highlight": {
		"pre_tags": "<em>",
		"post_tags": "</em>",
		"fields": {
			"name": {}
		}
	}
}

# 组合查询
# 使用 bool 过滤器，通过and、or、not逻辑，组合多个过滤器，判断文档是否应该包含在结果中

GET /myidx/_doc/_search
{
  "query": {
    "bool": {
      # =，多个条件以空格分隔(quick fast)
      "must":     { "match": { "title": "quick fast" }},
      # !=
      "must_not": { "match": { "title": "lazy"  }},
      # or
      "should": [
                  { "match": { "title": "brown" }},
                  { "match": { "title": "dog"   }}
      ],
      # 过滤
      "filter": {
      	"range": {
      	  "age": {
      	    "gte": 10,
      	    "lte": 25
      	  }
      	}
      }
    }
  }
}

bulk API

bulk

POST index/type/_bulk

# 批量操作
POST /index/type/_bulk
# 每两行为一组数据，_id指定文档ID
# 第二行为传输的数据
{"index": {"_id": "1101"}}
{"name": "doug lea"}

”match“：模糊匹配

“term”：精确匹配

集成Spring Boot

Spring Data ElasticSearch

spring-elasticsearch

API 地址：es-api

TCP 操作 es

9300 端口

建立长连接

官方不推荐使用

HTTP 操作 es

9200 端口

Elasticsearch-Rest-Client

Java Low Level REST Client

对 es api 封装度较低

Java High Level REST Client

==对 es api，query DSL 封装较高，推荐使用==

集成

mall-search

选择 web

添加依赖 high-level-client

<dependency>
	<groupId>org.elasticsearch.client</groupId>
  <artifactId>elasticsearch-rest-high-level-client</artifactId>
  <version>7.12.0</version>
</dependency>

dependencies {
  compile 'org.elasticsearch.client:elasticsearch-rest-high-level-client:7.12.0'
}

mall-search.pom

<properties>
	<elasticsearch.version>7.12.0</elasticsearch.version>
</properties>

配置

@Configuration
class ElasticSearchConfig {
  
  /**
   * 添加一些通用请求头
   */
  public static final RequestOptions COMMON_OPTIONS;
  static {
    RequestOptions.Builder builder = RequestOptions.DEFAULT.toBuilder();
    builder.addHeader("Authorization", TOKEN);
    
    COMMON_OPTIONS = builder.build();
  }
  
  @Bean
  public RestHighLevelClient restHighLevelClient() {
    RestClientBuilder builder = RestClient.builder(new HttpHost("192.168.0.104", 9200, "http"));
    return new RestHighLevelClient(builder);
  }
}

2. 测试

POST /accounts/_doc/_bulk

es查询领域特定语言：queryDSL

2.2 检索

# github测试数据练习

# match_phrase全文检索，不分词
GET accounts/_search
{
  "query": {
    "match_phrase": { "address": "mill lane"  }
  }
}

# multi_match，会分词
GET accounts/_search
{
  "query": {
    "multi_match": {
      "query": "mill lane",
      "fields": ["address", "city"]
    }
  }
}


# 复合查询 bool
# must 相当于 =
# must_not 相当于 != ，不会贡献得分
# should 相当于 or
GET accounts/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "gender": "F"}
          
        }, {
          "match": { "address": "mill" }
        }
      ],
      "must_not": [
        {
          "match": { "age": "38" }
        }
      ]
    }
  }
}


# filter过滤，不计算score
GET accounts/_search
{
  "query": {
    "bool": {
      "filter": [
        {"range": {
          "age": {
            "gte": 20,
            "lte": 30
          }
        }}
      ]
    }
  }
}

# match 全文检索，term精确匹配 == 字段.keyword == match_phrase

# TODO
GET accounts/_search
{
	"query": {
		"term": {
			"age": "28"
		}
	}
}

GET accounts/_search
{
	"query": {
		"match": {
			"address.keyword": "mill lane"
		}
	}
}

2.3 聚合

# 聚合

# 平均年龄
GET accounts/_search
{
  "query": {
    "match": { "address": "mill lane" }
  },
  "aggs": {
  	"aggAgg": {
  		"terms":{
  			"field": "age",
  			"size": 10
  		}
  	},
    "ageAvg": {
      "avg": {
        "field": "age"
      }
    }
  }
}

2.4 映射

GET acccounts/_mapping

# 数据迁移

POST _reindex
{
	"source": { "index": "accounts" },
	"dest": { "index": "new_acccounts" }
}

2.5 分词

POST _analyze

POST _analyze
{
	"analyzier": "stardard",
	"text": "this is a english word"
}

Kibana

// TODO

Files

ELK.md

Latest commit

History

ELK.md

File metadata and controls

ElasticSearch

安装

Docker

windows

Kibana

==检索进阶==

query

query DSL

示例一

示例二

示例三：复合查询 bool

示例四：filter

示例五：term

聚合

映射

创建索引的映射

添加索引的映射

修改索引的映射

reindex

ES 概念

索引

文档

类型

分片

分词

standard

ik 分词器

自定义词库

nginx 存放词库

Rest API

_cat

保存文档

乐观锁机制

查询文档

更新文档

删除

bulk API

bulk CRUD

数据类型

CRUD

bulk API

集成Spring Boot

TCP 操作 es

HTTP 操作 es

Elasticsearch-Rest-Client

Java Low Level REST Client

Java High Level REST Client

集成

2. 测试

2.2 检索

2.3 聚合

2.4 映射

2.5 分词

Kibana

Logstash