ES权威指南_01_get start_01 You Know, for Search…

混江龙づ霸主 提交于 2019-12-05 13:10:36

4 Distributed Document Store

Elasticsearch is much more than just Lucene and much more than “just” full-text search. It can also be described as follows:

  • A distributed real-time document store where every field is indexed and searchable
  • A distributed search engine with real-time analytics
  • Capable of scaling to hundreds of servers and petabytes(PB) of structured and unstructured

And it packages up all this functionality into a standalone server that your application can talk to via a simple RESTful API, using a web client from your favorite programming language, or even from the command line.

4.2 Talking to Elasticsearch(API,9300)

Java API
two built-in clients:

  1. Node client. Joins a local cluster as a non data node.有集群元信息,知道数据在哪,可直接发请求到数据所在节点
  2. Transport client.轻量级,send requests to a remote cluster. 不加入集群,仅发请求到集群中的节点。

java client和集群通过端口9300交互,使用 transport protocol.
同时集群的各节点间,也是通过这个端口相互通信。

注意:客户端的主版本号要和服务端的一致。
更多参考Elasticsearch Clients

RESTful API with JSON over HTTP(9200)

curl -X<VERB> '<PROTOCOL>://<HOST>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>'

PATH:API Endpoint。
?v
?pretty

4.3 Document Oriented

stores entire objects or documents. It not only stores them, but also indexes the contents of each document in order to make them searchable.

ES use JSON, as the serialization format for documents.
官方 Elasticsearch 客户端 自动为您提供 JSON 转化。

4.4 Finding Your Feet

The act of storing data in Elasticsearch is called indexing
An Elasticsearch cluster can contain multiple indices, which in turn contain multiple types.
These types hold multiple documents, and each document has multiple fields.

index is overloaded with several meanings in the context of Elasticsearch.

  • Index (noun),名词。like a database, index的复数是 indices or indexes.
  • Index (verb),动词。to store a document in an index (noun) so that it can be retrieved and queried.
  • Inverted index 。传统数据库add an index, such as a B-tree index, to specific columns in order to improve the speed of data retrieval. 然而ES使用了一种 inverted index 结构,达到同样的目的。
PUT /megacorp/employee/1
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
PUT /megacorp/employee/2
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}
PUT /megacorp/employee/3
{
    "first_name" :  "Douglas",
    "last_name" :   "Fir",
    "age" :         35,
    "about":        "I like to build cabinets",
    "interests":  [ "forestry" ]
}
  • 索引、类型、自定义ID
  • ES默认创建索引、类型,及动态映射。

查询:

GET /megacorp/employee/1

CURD:增(post/put)、删(delete)、存在(head)、修改(put)、查(GET)

4.5 Search Lite(轻量搜索)

搜索所有
GET /megacorp/employee/_search  
结果:
{
   "took":      6, // 花费时间,毫秒
   "timed_out": false,//是否超时,若超时可能是返回部分结果
   "_shards": { 
      "total": 5, //主分片数,默认
      "successful": 5, // 搜索成功的分片数
      "failed": 0 // 搜索失败的分片数
   },
   "hits": {
      "total":      3,//总文档
      "max_score":  1,//最高得分
      "hits": [
         {//默认返回10"_index":         "megacorp",
            "_type":          "employee",
            "_id":            "3",
            "_score":         1,
            "_source": {
               "first_name":  "Douglas",
               "last_name":   "Fir",
               "age":         35,
               "about":       "I like to build cabinets",
               "interests": [ "forestry" ]
            }
         }
         ...
      ]
   }
}

谁的姓氏是Smith?
q携带查询参数(字段名:查询值)

GET /megacorp/employee/_search?q=last_name:Smith
{
    "size":1 // 只需要返回1条文档,默认10
}
结果:
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,//总共2个文档匹配
      "max_score": 0.2876821,//最高得分
      "hits": [
         {//size=1,故返回1条
            "_index": "megacorp",
            "_type": "employee",
            "_id": "2",
            "_score": 0.2876821,
            "_source": {
               "first_name": "Jane",
               "last_name": "Smith",
               "age": 32,
               "about": "I like to collect rock albums",
               "interests": ["music"]
            }
         }
      ]
   }
}

4.6 Search with Query DSL(JSON)

Query-string search is handy for ad hoc(特别地、临时地) searches from the command line, but it has its limitations (see Search Lite).

Elasticsearch provides a rich, flexible, query language called the query DSL, which allows us to build much more complicated, robust queries.

The domain-specific language (DSL) is specified(指定) using a JSON request body.

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}

4.7 More-Complicated Searches

找出所有姓氏是smith的,且过滤掉年龄大于30的(不包含30)

GET /megacorp/employee/_search
{
    "query" : {
        "bool" : {//著名的bool查询
            "must" : {
                "match" : {
                    "last_name" : "smith" 
                }
            },
            "filter" : {//二元查询,yes or no,没有相关性、打分等。
                "range" : {
                    "age" : { "gt" : 30 } 
                }
            }
        }
    }
}

之前的match、bool查询,传统数据库一样可以做,来个全文搜索,传统数据库非常头疼的:

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "about" : "rock climbing" 
        }
    }
}
结果:
{
   "took": 43,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.53484553,
      "hits": [
         {
            "_index": "megacorp",
            "_type": "employee",
            "_id": "1",
            "_score": 0.53484553,
            "_source": {
               "first_name": "John",
               "last_name": "Smith",
               "age": 25,
               "about": "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         },
         {
            "_index": "megacorp",
            "_type": "employee",
            "_id": "2",
            "_score": 0.26742277,
            "_source": {
               "first_name": "Jane",
               "last_name": "Smith",
               "age": 32,
               "about": "I like to collect rock albums",
               "interests": [ "music" ]
            }
         }
      ]
   }
}

默认结果,按照相关性得分排序。
查询词:”about” : “rock climbing”
结果1:”I love to go rock climbing” ,匹配2个词
结果2:”I like to collect rock albums” ,匹配1个词

总结:理解相关性。

4.9 Phrase Search(短语搜索)

搜索2个term,期望以短语的形式出现,有先后顺序的关系。

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {// match的变体
            "about" : "rock climbing"
        }
    }
}
按短语查询,仅匹配一个文档:
{
   ...
   "hits": {
      "total":      1,
      "max_score":  0.23013961,
      "hits": [
         {
            ...
            "_score":         0.23013961,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         }
      ]
   }
}

4.10 Highlighting Our Searches(高亮查询term)

很多应用,需要在搜索结果中匹配部分。

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {//这里
        "fields" : {
            "about" : {}
        }
    }
}

查询结果:跟原来_source平级,有highlight代表高亮部分,默认以HTML的标记包裹。

{
   ...
   "hits": {
      "total":      1,
      "max_score":  0.23013961,
      "hits": [
         {
            ...
            "_score":         0.23013961,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            },
            "highlight": {
               "about": [
                  "I love to go <em>rock</em> <em>climbing</em>" 
               ]
            }
         }
      ]
   }
}

其它形式,如

    "highlight" : {
        "pre_tags" : ["<tag1>"],//不使用默认的em标签
        "post_tags" : ["</tag1>"],
        "fields" : {
            "_all" : {}
        }
    }

其它参考 https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-highlighting.html

4.11 Analytics(分析,aggs)

find the most popular interests enjoyed by our employees?

GET /megacorp/employee/_search
{
   "size": 0,//仅关心聚合,还可以加query查询条件
   "aggs": {
      "all_interests": {
         "terms": {//interests ES2.X
            "field": "interests.keyword" //ES5.x
         }
      }
   }
}

结果:

{
   ...
   "hits": { ... },
   "aggregations": {
      "all_interests": {
         "buckets": [
            { "key":"music", "doc_count": 2 },
            { "key":"forestry", "doc_count": 1 },
            { "key":"sports", "doc_count": 1 }
         ]
      }
   }
}

Aggregations allow hierarchical rollups too.
聚合还支持分级汇总 。比如,查询特定兴趣爱好员工的平均年龄

GET /megacorp/employee/_search
{
    "size":0,
    "aggs" : {
        "all_interests" : {
            "terms" : { "field" : "interests.keyword" },
            "aggs" : {
                "avg_age" : {
                    "avg" : { "field" : "age" }
                }
            }
        }
    }
}

结果:

 "all_interests": {
    "buckets": [
       {
          "key": "music",
          "doc_count": 2,
          "avg_age": {  "value": 28.5 }
       }
       ...
    ]
 }

即使现在不太理解这些语法也没有关系,依然很容易了解到复杂聚合及分组通过 Elasticsearch 特性实现得很完美。
Even if you don’t understand the syntax yet, you can easily see how complex aggregations and groupings can be accomplished using this feature.

可提取的数据类型毫无限制。
The sky is the limit as to what kind of data you can extract! 写这玩意的人,比较贱!!!

4.12 Distributed Nature

Elasticsearch is distributed by nature,

  • Partitioning your documents into different containers or shards,
  • Balancing these shards across the nodes
  • Duplicating each shard to provide redundant copies of your data,
  • Routing requests from any node in the cluster to the nodes that hold the data you’re interested in。可以向集群任一节点发请求,不是它能处理的,它会转发给能处理的节点。
  • Seamlessly integrating new nodes 。易扩容。

其它参考:
Life Inside a Cluster
Distributed Document Store
Distributed Search Execution
Inside a Shard

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!