[Elasticsearch]2.数据存储:文档和索引

こ雲淡風輕ζ 提交于 2020-08-14 02:00:16

数据存储:文档和索引

连载更新中...

Data in: Documents and indices

Elasticsearch是分布式文档存储系统.数据不是以列和行的形式存储的,而是被序列化为JSON文档存储的.如果在Elasticsearch集群中有多个存储节点,这些文档是分散在多个节点上的,并且在任何一个节点上都可以访问这些文档.就问你,神奇不神奇?

Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents. When you have multiple Elasticsearch nodes in a cluster, stored documents are distributed across the cluster and can be accessed immediately from any node.

文档在存储的时候会创建相应索引并且支持近实时的(1秒钟内)全文检索. 怎么就这么快呢?这是因为Elasticsearch内部使用了称为倒排序索引的方式支持快速全文检索.那啥子是倒排序索引呢?倒排序索引就是列出了文档中出现过的每个词(去重后)跟文档的对应关系.可以想象成一个map:key是词、value是包含这个词的所有文档的ID

When a document is stored, it is indexed and fully searchable in near real-time--within 1 second. Elasticsearch uses a data structure called an inverted index that supports very fast full-text searches. An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.

那啥子又是索引呢?可以把索引想成特意为存储文档优化的集合.那啥子又是文档呢? 文档就是字段的集合.那啥子又是字段呢?字段就是一些包含数据名称和数据值的键值对.默认情况下Elasticsearch会对文档中的每个字段进行索引处理并且会自动探查字段的数据类型好做对应的优化.比如:文本类型的字段使用倒排序索引、数字和地理位置类型的字段使用BKD树索引.在存储和搜索的时候根据不同的字段类型对数据进行特定的优化处理也是Elasticsearch之所以快的原因.没有免费的午餐.快是背后做了很多工作地.

An index can be thought of as an optimized collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data. By default, Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data structure. For example, text fields are stored in inverted indices, and numeric and geo fields are stored in BKD trees. The ability to use the per-field data structures to assemble and return search results is what makes Elasticsearch so fast.

另外Elasticsearch还是结构(Schema)要求宽松的.啥是结构(Schema)?可以看前一篇关于结构化数据半结构化数据非结构化数据的解释。简单说就是数据的格式定义。那什么是结构要求宽松呢?就像上一篇提到的半结构化数据一样,突然跟存储一个不存在的字段也不会报错.如果Elasticsearch开启了动态字段映射就可以自动的检测新提交存储的字段的类型并对这个字段进行相应的索引处理.这种功能是不是可以让我们很快的存储和探索数据?Elasticsearch会帮我们检测数据的类型具体是布尔类型还是浮点数类型、整数类型、日期类型还是字符串类型.如果增加了字段不用像Mysql一样还得先把表修改好.等等,Mysql给表添加字段的Sql语句是什么来着? 不但查询快,开发也快啊!

Elasticsearch also has the ability to be schema-less, which means that documents can be indexed without explicitly specifying how to handle each of the different fields that might occur in a document. When dynamic mapping is enabled, Elasticsearch automatically detects and adds new fields to the index. This default behavior makes it easy to index and explore your data—just start indexing documents and Elasticsearch will detect and map booleans, floating point and integer values, dates, and strings to the appropriate Elasticsearch data types.

然而总归你比(你也应该)比Elasticsearch更了解你的数据,Elasticsearch是根据一些规则判断数据类型的,你可是21世纪最先进的智能人.如果你觉得Elasticsearch自动检测的数据类型不符合实际需要或者有进一步优化的空间,你可以通过定义动态字段映射规则或者为字段定义特定的类型来控制字段的存储和索引方式.

Ultimately, however, you know more about your data and how you want to use it than Elasticsearch can. You can define rules to control dynamic mapping and explicitly define mappings to take full control of how fields are stored and indexed.

通过自定义映射我们可以:

Defining your own mappings enables you to:

  • 区分那些字符串类型的字段是用于全文检索的那些不需要全文检索.(区分Text和Keyword)

  • > Distinguish between full-text string fields and exact value string fields

  • 对特定的语言采用特定的分词器(比如对中文使用ik分词器)

  • > Perform language-specific text analysis

  • 对字段进行某些特定匹配场景优化

  • > Optimize fields for partial matching

  • 使用自定义的数据结构

  • > Use custom date formats

  • 使用一些不能被自动探测出的类型,如:geo_pointgeo_shape类型

  • Use data types such as geo_point and geo_shape that cannot be automatically detected

我们经常需要对同一个字段进行不同的索引处理.比如:对于一个字符串类型的字段,我们可能会把它当做Text类型的字段以支持全文检索,还可能把它做为Keyword类型的字段以用于对数据进行排序和聚合操作.再比如:对一个字段即使用中文的分词引擎又需要使用英文的分词引擎.毕竟我们有些人平常说话都是中英文混杂的.我们一起赚Money啊,A lot a lot 啊!

It’s often useful to index the same field in different ways for different purposes. For example, you might want to index a string field as both a text field for full-text search and as a keyword field for sorting or aggregating your data. Or, you might choose to use more than one language analyzer to process the contents of a string field that contains user input.

Elasticsearch会在搜索的时候对搜索文本使用跟文档存储的时候对全文检索(Text)类型的字段使用的相同分词链进行分词处理。存储和搜索时使用的分词处理一样才能保证能够搜索到正确的结果嘛.

The analysis chain that is applied to a full-text field during indexing is also used at search time. When you query a full-text field, the query text undergoes the same analysis before the terms are looked up in the index.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!