数据来源:http://dataju.cn/Dataju/web/datasetInstanceDetail/268
下载下来之后是一个 CSV 数据,通过 kibana 的 Data Visualizer 功能导入到 Elasticsearch 中,之后就可以查询了。
一个简单的 bool 查询:
GET /race-for-president-2016/_search
{
"query": {
"bool": {
"must": [
{ "match": { "Text": "New York"}}
],
"filter": [
{"term": { "Speaker": "Holt" }}
]
}
}
}
这样子查询出来的文档时带有 _score
即分数的,默认是按分数倒排序的。
如果人为加上排序:
GET /race-for-president-2016/_search
{
"query": {
"bool": {
"must": [
{ "match": { "Text": "New York"}}
],
"filter": [
{"term": { "Speaker": "Holt" }}
]
}
},
"sort": [
{"Line": { "order": "desc" }}
]
}
就不会有分数了。
当然也可以按分数正排:
GET /race-for-president-2016/_search
{
"query": {
"bool": {
"must": [
{ "match": { "Text": "New York"}}
],
"filter": [
{"term": { "Speaker": "Holt" }}
]
}
},
"sort": [
{"_score": { "order": "asc" }}
]
}
如果想按 Line 排序,也想要分数怎么办,参见:https://www.elastic.co/guide/en/elasticsearch/guide/current/_sorting.html
可以这样:
GET /race-for-president-2016/_search
{
"query": {
"bool": {
"must": [
{ "match": { "Text": "New York"}}
],
"filter": [
{"term": { "Speaker": "Holt" }}
]
}
},
"sort": [
{"_score": { "order": "asc" }},
{"Line": { "order": "desc" }}
]
}
如果想高亮显示某个单词怎么办,参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html
POST /race-for-president-2016/_search
{
"query": {
"match": {
"Text": "New York"
}
},
"highlight": {
"fields": {
"Text": {}
}
}
}
部分返回如下:
{
"_index" : "race-for-president-2016",
"_type" : "_doc",
"_id" : "Jsmf13MBam6_7L8uDJbD",
"_score" : 8.121967,
"_source" : {
"Line" : 167,
"@timestamp" : "2016-09-26T00:00:00.000+08:00",
"Text" : "Your two -- your two minutes expired, but I do want to follow up. Stop-and-frisk was ruled unconstitutional in New York, because it largely singled out black and Hispanic young men.",
"Date" : "9/26/16",
"Speaker" : "Holt"
},
"highlight" : {
"Text" : [
"Stop-and-frisk was ruled unconstitutional in <em>New</em> <em>York</em>, because it largely singled out black and Hispanic"
]
}
}
可以发现,在 highlight 中有 em,并且只显示与关键词相关的一句话,多余的部分不会显示。
上边的搜索有一部分会将 New York
两个单词分开,如果想连为一体查询,可以使用短语搜索:
POST /race-for-president-2016/_search
{
"query": {
"match_phrase": {
"Text": "New York"
}
},
"highlight": {
"fields": {
"Text": {}
}
}
}
查询主持人的发言,主持人名为 Holt
POST /race-for-president-2016/_search
{
"query": {
"bool": {
"filter": [
{"term": {"Speaker": "Holt"}}
]
}
}
}
这样查出来是分数是 0.0,如果想要带分数,可以使用:
POST /race-for-president-2016/_search
{
"query": {
"constant_score": {
"filter": {
"term": {"Speaker": "Holt"}
}
}
}
}
这样查出来的分数是 1.0
上边的俩查询是可以显示文档数量的,但是默认只会显示十条文档,如果想显示多一些,可以这样:
POST /race-for-president-2016/_search
{
"query": {
"bool": {
"filter": [
{"term": {"Speaker": "Holt"}}
]
}
},
"size": 20
}
不过从性能上考虑,一次查询不要返回过多文档,如果文档过多可以利用分页,比如从第二个文档开始查询:
POST /race-for-president-2016/_search
{
"query": {
"bool": {
"filter": [
{"term": {"Speaker": "Holt"}}
]
}
},
"size": 20,
"from": 2
}
如果只想看看主持人发言的总数量,可以这样:
参见:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-count.html
POST /race-for-president-2016/_count
{
"query": {
"bool": {
"filter": [
{"term": {"Speaker": "Holt"}}
]
}
}
}
以上就是分页查询的全部内容了。
查看有多少发言人:
POST /race-for-president-2016/_search
{
"size": 0,
"aggs": {
"Number of speakers": {
"cardinality": {
"field": "Speaker"
}
}
}
}
部分返回如下:
"aggregations" : {
"Number of speakers" : {
"value" : 11
}
}
发现总共有 11 个发言人。
聚合查看每个发言人都发言了多少次:
POST /race-for-president-2016/_search
{
"size": 0,
"aggs": {
"Speakers and number of speeches": {
"terms": {
"field": "Speaker"
}
}
}
}
部分返回如下:
"aggregations" : {
"Speakers and number of speeches" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 8,
"buckets" : [
{
"key" : "Trump",
"doc_count" : 224
},
{
"key" : "Clinton",
"doc_count" : 159
},
{
"key" : "Pence",
"doc_count" : 134
},
{
"key" : "Kaine",
"doc_count" : 124
},
{
"key" : "Holt",
"doc_count" : 98
},
{
"key" : "Quijano",
"doc_count" : 76
},
{
"key" : "Cooper",
"doc_count" : 75
},
{
"key" : "Raddatz",
"doc_count" : 62
},
{
"key" : "CANDIDATES",
"doc_count" : 37
},
{
"key" : "Audience",
"doc_count" : 29
}
]
}
}
川建国同志最能说了。