分布式全文检索引擎
es 自动集群,默认集群名称:elasticsearch
- 下载镜像
docker pull elasticsearch:7.12.0
docker pull kibaba:7.12.0
- 创建实例
# 创建 config 和 data 目录做 容器运行的映射
mkdir -p /Docker/elasticsearch/config
mkdir -p /Docker/elasticsearch/data
mkdir -p /Docker/elasticsearch/plugins
echo "http.host:0.0.0.0" >> /Docker/elasticsearch/config/elasticsearch.yml
# 9200 rest 访问暴露端口,集群模式下 9300 为节点的通信端口
# -v 挂载目录
# plugins 插件目录,如 ik 分词器
docker run --name elasticsearch -p 9200:9200 -p 9300:9300\
-e "discovery.type=single-node"\
-e ES_JAVA_OPTS="-Xms256m -Xmx256m"\
-v /d/Docker/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml\
-v /d/Docker/elasticsearch/data:/usr/share/elasticsearch/data\
-v /d/Docker/elasticsearch/plugins:/usr/share/elasticsearch/plugins\
-d elasticsearch:7.12.0
# 查看日志
docker logs containerId
# 自动启动
docker update containerId --restart=always
配置文件只覆盖 elasticsearch.yml
docker run --name elasticsearch -v /d/Docker/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml -v /d/Docker/elasticsearch/data/:/usr/share/elasticsearch/data -v /d/Docker/elasticsearch/plugins/:/usr/share/elasticsearch/plugins -v /d/Docker/elasticsearch/logs/:/usr/share/elasticsearch/logs -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e ES_JAVA_OPTS="-Xms512m -Xmx512m" -d elasticsearch:7.12.0
docker update containerId --restart=always
docker run --name kibana -e ELASTICSEARCH_HOSTS=http://192.168.0.104:9200 -p 5601:5601\
-d kibana:7.12.0
有两种方式
- query
- request body「query DSL」
GET bank/_search?q=*&sort=account_number:asc
GET bank/_search
{
"query": {
"match_all": {}
},
# 排序
"sort": [
{
"account_number": "asc",
"balance": "desc"
}
],
# 分页
"from": 0,
"size": 5,
# 指定字段查询
"_source": ["firstname", "balance"]
}
倒排索引进行全文检索
- match
- match_phase
- multi-match
# 结果按 _score 排序
GET bank/_search
{
"query": {
# 对检索条件「mill movico」进行分词匹配
"match": {
"address": "mill movico"
},
# 不分词匹配
"match_phase": {
"address": "mill movico"
# 精确匹配:mill movico cc 不能被查询到
# "address.keyword": "mill movico"
},
# 多字段匹配
"multi_match": {
"query": "mill movico",
# address 或者 city 字段匹配 mill movico
"fields": ["address", "city"]
}
}
}
GET bank/_search
{
"query": {
"bool": {
# 必须 类似 =
"must": [
{
"match": {
"gender": "F"
}
},
{
"match": {
"address": "mill"
}
}
],
# 必须不 类似 !=
"must-not": [
{
"match": {
"age": "38"
}
}
],
# 应该满足 类似 or
"should": [
{
"match": {
"lastname": "Wallace"
}
}
],
"filter": {
"range": {
...
}
}
}
}
}
filter 不贡献 score
must / must-not /should 会计算 score
GET bank/_search
{
"query": {
"bool": {
# "must": {
"filter": {
"range": {
"age": {
"gte": 18,
"lte": 30
}
}
}
}
}
}
对于非 text 字段的精确匹配使用 term
对于 text 字段的精确匹配使用 column.keyword
GET bank/_seach
{
"query": {
"term": {
"age": 28
}
}
}
整数数值类型默认 Long
keyword 类型:不做全文检索,而是精确匹配
text 类型:在保存数据的时候进行分词,检索时按照分词全文检索
PUT /my_index
{
"mappings": {
"properties": {
"age": {"type": "integer"},
"email": {"type": "keyword"},
"name": {"type": "text"}
}
}
}
PUT /my_index/_mapping
{
"properties": {
"type": "keyword",
"index": false
}
}
无法修改索引的映射
- 创建一个新的索引,指定正确的索引关系
- 数据迁移
# 1
PUT /newbank
{
"mappings": {
"properties": {
"account_number": {
"type": "long",
...
}
}
}
}
# 2
# 6.0 以后 _reindex
POST _reindex
{
"source": {
"index": "bank"
},
"dest": {
"index": "newbank"
}
}
# 6.0 以前的索引迁移到 6.0 以后
# type 都会修改为 _doc
POST _reindex
{
"source": {
"index": "bank",
# 需要指定 type
"type": "account"
},
"dest": {
"index": "newbank"
}
}
对应关系型数据库的库
对应关系型数据库表中一条记录
默认"_doc"类型,已弃用
分片:是ES中所有数据的文件快,也是数据的最小单元,整个ES集群的核心就是对所有分片进行分布,索引,负载,路由,分片平均的存储整个集群的所有数据
默认5个分片(primary shard,主分片),每个主分片将有一个副本(repica shard,复制分片)
主分片与复制分片会在不同节点中,==一个分片是一个Lucene索引(一个包含倒排索引的文件目录)==
倒排索引:倒排索引包含一个有序列表,列表包含所有文档出现过的不重复个体(term,词元),每个词元包含了它所有曾出现过的文档列表
doc1: good good study, up up every day
doc2:day day up, good good study
term(词元) | doc1 | doc2 | 倒排索引 |
---|---|---|---|
day | √ | √ | 1,2 |
every | √ | × | 1 |
good | √ | √ | 1,2 |
study | √ | √ | 1,2 |
up | √ | √ | 1,2 |
分词:一个 tokenizer「分词器」接受一个字符流,将其分割为独立的 token「词元」,然后输出 tokens 流
# 标准分词器
POST _analyze
{
"analyzer": "standard",
"text": "尚硅谷电商项目"
}
# 结果
"尚"
"硅"
"谷"
"电"
"商"
"项"
"目"
中文分词器:ik
- ik_smart
- id_max_word
# install 进入 es 安装目录 plugins
cd $ES_HOME$/plugins && mkdir ik
# 解压到ik文件夹
unzip elastic-search-analysis-ik-7.12.0.zip
# chmod
chmod -R 777 ik/
# 将我们需要的业务词元添加到自定义字典中
IKAnalyzer.cfg.xml中<entry key="ext_dict">my.dic</entry>
ik_smart
POST _analyze
{
"analyzer": "ik_smart",
"text": "我是中国人"
}
# 结果
我
是
中国人
id_max_word
POST _analyze
{
"analyzer": "ik_max_word",
"text": "我是中国人"
}
# 结果
我
是
中国人
中国
国人
由 ik 分词器向 nginx 发起请求获取新词库
- 安装 nginx
随意启动一个 nginx 实例,只是为了复制其配置
docker run -p 80:80 --name nginx -d nginx:1.10
拷贝 nginx 运行的文件到宿主机
docker container cp nginx:/etc/nginx ./conf
复制 nginx 容器下 /etc/nginx 文件夹到当前目录的 conf 下
删除 nginx 实例
docker stop nginx docker rm nginx
执行以下命令
mkdir nginx
mv conf nginx/
docker run -p 80:80 --name nginx\
-v /yangzl/docker/nginx/html:/usr/share/niginx/html\
-v /yangzl/docker/nginx/logs:/var/log/nginx\
-v /yangzl/docker/nginx/conf:/etc/nginx\
-d nginx:1.10
此时,nginx 目录下有三个文件夹
- conf
- html
- logs
- 继续创建文件夹 es
vim html/index.html
mkdir es
cd es
vim terms.txt
尚硅谷
巧碧螺
shift zz
- 继续回到 ik 分词器的 conf
cd ik/conf
vim IKAnalyzer.cfg.xml
# 远程字典配置
<entry key="remote_ext_dict">http://192.168.0.104:80/es/terms.txt</entry>
# 本地字典配置
==API 地址:==es-api
- PUT
创建(指定文档ID):localhost:9200/idx/_doc/id
- POST
创建(随机文档ID):localhost:9200/idx/_doc
修改:localhost:9200/idx/_doc/doctId/_update
查询所有数据:localhost:9200/idx/_doc/_search
- DELETE
删除:localhost:9200/idx/_doc/docId
- GET
通过文档ID查询数据:localhost:9200/idx/type/docId
# 所有节点
GET /_cat/nodes
# es 健康状况
GET /_cat/health
# 查看 master 节点
GET /_cat/master
# 查看所有索引 show database;
GET /_cat/indices
- PUT「必须携带 ID」
- POST
# index1 索引名称
# _doc「typeless」
# 1001 文档 id
PUT index1/_doc/1001
-
Postman
-
IDEA HTTP Client
###
POST http://127.0.0.1:9200/index1/_doc/1001
Content-Type: application/json
{
"name": "Jone Doe"
}
### 乐观锁更新请求1
POST http://127.0.0.1:9200/index1/_doc/1001?if_seq_no=3&if_primary_term=1
Content-Type: application/json
{
"name": "Jone Doe2"
}
### 乐观锁更新请求2
POST http://127.0.0.1:9200/index1/_doc/1001?if_seq_no=3&if_primary_term=1
Content-Type: application/json
{
"name": "Jone Doe2"
}
# 第二个更新请求结果 409
{
"error": {
"type": "version_conflict_engine_exception",
"reason": "[1001]: version conflict, required seqNo [3], primary term [1]. current document has seqNo [4] and primary term [1]",
"index_uuid": "e-eWHvqrRx2FDF4Fsyo4ng",
"shard": "0",
"index": "index1"
},
"status": 409
}
- Postman
- IDEA HTTP Client
###
GET http://127.0.0.1:9200/index1/_doc/1001
HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
{
"_index": "index1",
"_type": "_doc",
"_id": "1001",
"_version": 3,
"_seq_no": 3,
"_primary_term": 1,
"found": true,
"_source": {
"name": "Jone Doe"
}
}
# 业务可根据该字段自己做乐观更新
# ?if_seq_no=1&if_primary_term=1
_seq_no:并发控制字段,每个事务请求都会 +1,提供乐观锁机制
# 一
# 对比数据是否变化,如果数据无变化,则返回 noop
# 需要将数据放在 doc 中
POST index1/_doc/1001/_update
{
"doc": {
"name": "John3"
}
}
# 二
# 版本号 seq_no 变化
POST index1/_doc/1001
# 三
PUT index1/_doc/1001
- 删除文档
- 删除索引
DELETE index1/_doc/1001
# 删除索引
DELETE index1
POST index1/_doc/_bulk
Content-Type:application/json
# 两行为一个整体
{"index": {"_id": "1"}}
{"name": "John Doe"}
{"index": {"_id": "2"}}
{"name": "Jane Done"}
POST /_bulk
{"delete":{"_index": "index1", "_id": "1001"}}
{"create":{"_index": "index1", "_id": "1002"}}
{"title": "My first blog"}
{"index": {"_index": "index1"}}
{"title": "My second blog"}
{"update": {"_index": "index1", "_id": "1002"}}
{"doc": {"title": "My first update blog"}}
delete 单独一行
其他有提交数据的每两行为一个整体
-
字符串
- text
- keyword,分词器不解析
-
数值类型
long,integer,short,byte,double,float,half_float,scaled_float
-
日期类型 date
-
布尔类型 boolean
-
二进制 binary
# 在Kibana中发送Rest请求
# RESTful风格的创建
# 不指定数据类型
PUT /myidx/_doc/1
{
"name": "yangzl",
"age": 24
}
# 指定数据类型,keyword不被分词器解析
PUT /myidx
{
”mappings“: {
"properties": {
"name": { "type": "keyword" },
"desc": { "type": "text" }
}
}
}
PUT /myidx/_doc/1
{
"name": "yangzl",
"desc": "我是一颗小虎牙Python"
}
创建之后数据结构如下:
_index | _type | _id | _score | name | age |
---|---|---|---|---|---|
myidx | _doc | 1 | 1 | yangzl | 24 |
# 修改
# PUT也可以,当数据缺失时会覆盖掉原数据
POST /myidx/_doc/1/_update
{
"doc": {
"name": "法外狂徒"
}
}
# RESTful风格删除
DELETE /myidx # 删除索引
DELETE /myidx/_doc/1 # 删除文档
# 查询
# Query parameters
# h代表要查询的字段,以逗号分隔
# format: JSON, YAML, etc
# local: if true requests retrieve from local node only, else from the master node
# v: includes column headings or not
# s: 以列的别名排序字段名,多个以逗号分隔
GET /myidx/_doc/1
GET /myidx/_doc/1/_search?q=name:yangzl
#*****************************************
# 复杂查询(构建查询,使用Client时对应API)
GET /myidx/_doc/_search
{
"query": {
"match": { "name": "yangzl" }
},
# 查询指定字段
"_source": ["name", "age"],
# 排序
“sort”: [
{
"age": { "order": "asc" }
}
],
# 分页,同LIMIT
"from": 0,
"size": 10,
# 高亮结果
"highlight": {
"pre_tags": "<em>",
"post_tags": "</em>",
"fields": {
"name": {}
}
}
}
# 组合查询
# 使用 bool 过滤器,通过and、or、not逻辑,组合多个过滤器,判断文档是否应该包含在结果中
GET /myidx/_doc/_search
{
"query": {
"bool": {
# =,多个条件以空格分隔(quick fast)
"must": { "match": { "title": "quick fast" }},
# !=
"must_not": { "match": { "title": "lazy" }},
# or
"should": [
{ "match": { "title": "brown" }},
{ "match": { "title": "dog" }}
],
# 过滤
"filter": {
"range": {
"age": {
"gte": 10,
"lte": 25
}
}
}
}
}
}
bulk
POST index/type/_bulk
# 批量操作
POST /index/type/_bulk
# 每两行为一组数据,_id指定文档ID
# 第二行为传输的数据
{"index": {"_id": "1101"}}
{"name": "doug lea"}
”match“:模糊匹配
“term”:精确匹配
Spring Data ElasticSearch
API 地址:es-api
9300 端口
建立长连接
官方不推荐使用
9200 端口
对 es api 封装度较低
==对 es api,query DSL 封装较高,推荐使用==
mall-search
选择 web
- 添加依赖 high-level-client
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-high-level-client</artifactId>
<version>7.12.0</version>
</dependency>
dependencies {
compile 'org.elasticsearch.client:elasticsearch-rest-high-level-client:7.12.0'
}
- mall-search.pom
<properties>
<elasticsearch.version>7.12.0</elasticsearch.version>
</properties>
- 配置
@Configuration
class ElasticSearchConfig {
/**
* 添加一些通用请求头
*/
public static final RequestOptions COMMON_OPTIONS;
static {
RequestOptions.Builder builder = RequestOptions.DEFAULT.toBuilder();
builder.addHeader("Authorization", TOKEN);
COMMON_OPTIONS = builder.build();
}
@Bean
public RestHighLevelClient restHighLevelClient() {
RestClientBuilder builder = RestClient.builder(new HttpHost("192.168.0.104", 9200, "http"));
return new RestHighLevelClient(builder);
}
}
POST /accounts/_doc/_bulk
es查询领域特定语言:
queryDSL
# github测试数据练习
# match_phrase全文检索,不分词
GET accounts/_search
{
"query": {
"match_phrase": { "address": "mill lane" }
}
}
# multi_match,会分词
GET accounts/_search
{
"query": {
"multi_match": {
"query": "mill lane",
"fields": ["address", "city"]
}
}
}
# 复合查询 bool
# must 相当于 =
# must_not 相当于 != ,不会贡献得分
# should 相当于 or
GET accounts/_search
{
"query": {
"bool": {
"must": [
{ "match": { "gender": "F"}
}, {
"match": { "address": "mill" }
}
],
"must_not": [
{
"match": { "age": "38" }
}
]
}
}
}
# filter过滤,不计算score
GET accounts/_search
{
"query": {
"bool": {
"filter": [
{"range": {
"age": {
"gte": 20,
"lte": 30
}
}}
]
}
}
}
# match 全文检索,term精确匹配 == 字段.keyword == match_phrase
# TODO
GET accounts/_search
{
"query": {
"term": {
"age": "28"
}
}
}
GET accounts/_search
{
"query": {
"match": {
"address.keyword": "mill lane"
}
}
}
# 聚合
# 平均年龄
GET accounts/_search
{
"query": {
"match": { "address": "mill lane" }
},
"aggs": {
"aggAgg": {
"terms":{
"field": "age",
"size": 10
}
},
"ageAvg": {
"avg": {
"field": "age"
}
}
}
}
GET acccounts/_mapping
# 数据迁移
POST _reindex
{
"source": { "index": "accounts" },
"dest": { "index": "new_acccounts" }
}
POST _analyze
POST _analyze
{
"analyzier": "stardard",
"text": "this is a english word"
}
// TODO