ES 入门之后,就要来学习一下 Query DSL 了,因为 ES 中的全部搜索功能全局集中在 Query DSL 中了,学会 Query DSL 就可以随意搜索你自己的文档了。不过这块内容还是比较多的,需要花一段时间。
本文完全参考官方文档:Query DSL
默认地,ES的搜索结果是根据相关性得分排序的,相关性得分是一个文档根据匹配得到的分数。
query 子句与 filter 子句
query 子句的意思是:此文档与此查询子句的匹配程度如何?query 子句是要计算文档得分的。
filter 子句的意思是:此文档是否与此查询子句匹配?filter 子句不需要就算文档得分。
bool 子句中的 must_not 也是一种 filter 。filter 常出现在 bool 子句,constant_score 子句和 filter 聚合查询中。
一个关于 query 和 filter 的例子:
GET /_search
{
"query": {
"bool": {
"must": [
{ "match": { "title": "Search" }},
{ "match": { "content": "Elasticsearch" }}
],
"filter": [
{ "term": { "status": "published" }},
{ "range": { "publish_date": { "gte": "2015-01-01" }}}
]
}
}
}
bool 有四种子句,分别是:must、filter、must_not、should
- must:查询内容必须出现在文档中,匹配成功则加分
- filter:与 must 一样,不过不计算得分
- should:查询内容可以出现,也可以不出现,但是一旦出现,会加分
- must_not:查询内容一定不出现,不计算得分,是一种 filter
例子:
POST _search
{
"query": {
"bool" : {
"must" : {
"term" : { "user" : "kimchy" }
},
"filter": {
"term" : { "tag" : "tech" }
},
"must_not" : {
"range" : {
"age" : { "gte" : 10, "lte" : 20 }
}
},
"should" : [
{ "term" : { "tag" : "wow" } },
{ "term" : { "tag" : "elasticsearch" } }
],
"minimum_should_match" : 1,
"boost" : 1.0
}
}
}
文档的最终得分就是 must 中的分数加上 should 中的分数。
minimum_should_match的意思是 should 中必须匹配到的子句数量,如果 bool 中至少包含一个 should ,并且没有 must 和 filter,那这个值默认就是1,如果含有 must 或 filter 的话,默认就是0.
filter 匹配到文档得分为0,如果 must 中的查询语句为 match_all ,则每个文档的得分为1.0:
GET _search
{
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"term": {
"status": "active"
}
}
}
}
}
上面这个查询和 constant_score 子句实现的效果是一模一样的,得分都是1.0:
GET _search
{
"query": {
"constant_score": {
"filter": {
"term": {
"status": "active"
}
}
}
}
}
boosting 有三个子句:positive、negative、和 negative_boost
positive 表示必须匹配到才返回文档。
negative 中的子句如果匹配到了 positive 中返回的文档,则对文档分数进行计算,计算方式是原始得分乘以 negative_boost 的分数
下边是实例:
GET /_search
{
"query": {
"boosting" : {
"positive" : {
"term" : {
"text" : "apple"
}
},
"negative" : {
"term" : {
"text" : "pie tart fruit crumble tree"
}
},
"negative_boost" : 0.5
}
}
}
negative_boost 的值位于 0 到 1 之间。
constant_score 子句用于固定得分,主要包含两个子句:filter、boost
意思就是 filter 匹配到的所有子句都是 boost 指定的分数,boost 默认为 1.0:
GET /_search
{
"query": {
"constant_score" : {
"filter" : {
"term" : { "user" : "kimchy"}
},
"boost" : 1.2
}
}
}
dis_max 可以实现 best fields 策略多字段搜索。
什么意思那,首先来个例子,加入以下数据到ES。
POST /forum/_bulk?pretty&refresh
{ "index": { "_id": "1"} }
{ "doc" : {"title": "this is elasticsearch blog", "content" : "i like to write best elasticsearch article"} }
{ "index": { "_id": "2"} }
{ "doc" : {"content" : "i think java is the best programming language", "title": "this is java blog"} }
{ "index": { "_id": "3"} }
{ "doc" : {"content" : "i am only an elasticsearch beginner", "title": "this is elasticsearch blog"} }
{ "index": { "_id": "4"} }
{ "doc" : {"content" : "elasticsearch and hadoop are all very good solution, i am a beginner", "title": "this is java, elasticsearch, hadoop blog"} }
{ "index": { "_id": "5"} }
{ "doc" : {"content" : "spark is best big data solution based on scala ,an programming language similar to java", "title": "this is spark blog"} }
然后执行一个 bool 子句查询:
GET /forum/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"doc.title": "java solution"
}
}, {
"match": {
"doc.content": "java solution"
}
}
]
}
}
}
这是的返回结果是:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.8488126,
"hits" : [
{
"_index" : "forum",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.8488126,
"_source" : {
"doc" : {
"content" : "i think java is the best programming language",
"title" : "this is java blog"
}
}
},
{
"_index" : "forum",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.5563383,
"_source" : {
"doc" : {
"content" : "elasticsearch and hadoop are all very good solution, i am a beginner",
"title" : "this is java, elasticsearch, hadoop blog"
}
}
},
{
"_index" : "forum",
"_type" : "_doc",
"_id" : "5",
"_score" : 1.4233949,
"_source" : {
"doc" : {
"content" : "spark is best big data solution based on scala ,an programming language similar to java",
"title" : "this is spark blog"
}
}
}
]
}
}
可以看到,title 和 content 都匹配到单词的得分更高,而 id=5 的 content 中,java 和 solution 都存在,但是得分却低。
bool 语句的分数生成策略是:
(第一条得分 + 第二条得分)* 匹配的条数 / 总条数
注意这里 匹配的条数
,因为 id=5 的文档中,title 中并没有匹配到关键词,所以在相乘中很吃亏。
那么这种情况下,使用 dis_max 子句就可以解决这个问题:
GET /forum/_search
{
"query": {
"dis_max": {
"queries": [
{
"match": {
"doc.title": "java solution"
}
},{
"match": {
"doc.content": "java solution"
}
}
]
}
}
}
其结果为:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.4233949,
"hits" : [
{
"_index" : "forum",
"_type" : "_doc",
"_id" : "5",
"_score" : 1.4233949,
"_source" : {
"doc" : {
"content" : "spark is best big data solution based on scala ,an programming language similar to java",
"title" : "this is spark blog"
}
}
},
{
"_index" : "forum",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.9395274,
"_source" : {
"doc" : {
"content" : "i think java is the best programming language",
"title" : "this is java blog"
}
}
},
{
"_index" : "forum",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.7942397,
"_source" : {
"doc" : {
"content" : "elasticsearch and hadoop are all very good solution, i am a beginner",
"title" : "this is java, elasticsearch, hadoop blog"
}
}
}
]
}
}
dis_max 的策略很简单,直接取做高分数的 query。
dis_max 还可以加一个 tie_breaker 子句,用于平衡两者,它默认是0,所以默认时直接取最高 query 的分数:
GET /forum/_search
{
"query": {
"dis_max": {
"queries": [
{
"match": {
"doc.title": "java solution"
}
},{
"match": {
"doc.content": "java solution"
}
}
],
"tie_breaker" : 0.7
}
}
}
加上 tie_breaker 后,策略如下:
- 取最高query的分数
- 将其他每一个query的分数乘以 tie_breaker
- 将最高 query 的分数加上第二步中计算得到的每一个值
function_score 子句可以自定义文档得分!例如:
GET /_search
{
"query": {
"function_score": {
"query": { "match_all": {} },
"boost": "5",
"random_score": {},
"boost_mode":"multiply"
}
}
}