Importing nested JSON documents into Elasticsearch and making them searchable

牧云@^-^@ 提交于 2019-12-08 12:04:27

问题


We have MongoDB-collection which we want to import to Elasticsearch (for now as a one-off effort). For this end, we have exported the collection with monogexport. It is a huge JSON file with entries like the following:

    {
     "RefData" : {
      "DebtInstrmAttrbts" : {
        "NmnlValPerUnit" : "2000",
        "IntrstRate" : {
          "Fxd" : "3.1415"
        },
        "MtrtyDt" : "2020-01-01",
        "TtlIssdNmnlAmt" : "200000000",
        "DebtSnrty" : "SNDB"
      },
      "TradgVnRltdAttrbts" : {
        "IssrReq" : "false",
        "Id" : "BMTF",
        "FrstTradDt" : "2019-04-01T12:34:56.789"
      },
      "TechAttrbts" : {
        "PblctnPrd" : {
          "FrDt" : "2019-04-04"
        },
        "RlvntCmptntAuthrty" : "GB"
      },
      "FinInstrmGnlAttrbts" : {
        "ClssfctnTp" : "DBFNXX",
        "ShrtNm" : "AVGO  3.625  10/16/24 c24 (URegS)",
        "FullNm" : "AVGO 3  5/8  10/15/24 BOND",
        "NtnlCcy" : "USD",
        "Id" : "USU1109MAXXX",
        "CmmdtyDerivInd" : "false"
      },
      "Issr" : "549300WV6GIDOZJTVXXX"
    }

We are using the following Logstash configuration file to import this data set into Elasticsearch:

input {
  file {
    path => "/home/elastic/FIRDS.json"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    codec => json
  }
}
filter {
  mutate {
    remove_field => [ "_id", "path", "host" ]
  }
}
output {
  elasticsearch {
     hosts => [ "localhost:9200" ]
     index => "firds"
  }
}

All this works fine, the data ends up in the index firds of Elasticsearch, and a GET /firds/_search returns all the entries within the _source field. We understand that this field is not indexed and thus is not searchable, which we are actually after. We want make all of the entries within the original nested JSON searchable in Elasticsearch.

We assume that we have to adjust the filter {} part of our Logstash configuration, but how? For consistency reasons, it would not be bad to keep the original nested JSON structure, but that is not a must. Flattening would also be an option, so that e.g.

"RefData" : {
      "DebtInstrmAttrbts" : {
        "NmnlValPerUnit" : "2000" ... 

becomes a single key-value pair "RefData.DebtInstrmAttrbts.NmnlValPerUnit" : "2000".

It would be great if we could do that immediately with Logstash, without using an additional Python script operating on the JSON file we exported from MongoDB.

EDIT: Workaround Our current work-around is to (1) dump the MongoDB database to dump.json and then (2) flatten it with jq using the following expression, and finally (3) manually import it into Elastic

ad (2): This is the flattening step:

jq  '. as $in | reduce leaf_paths as $path ({}; . + { ($path | join(".")): $in | getpath($path) })  | del(."_id.$oid") ' 
    -c dump.json > flattened.json

References

  • Walker Rowe: ElasticSearch Nested Queries: How to Search for Embedded Documents
  • ElasticSearch search in document and in dynamic nested document
  • Mapping for Nested JSON document in Elasticsearch
  • Logstash - import nested JSON into Elasticsearch

Remark for the curious: The shown JSON is a (modified) entry from the Financial Instruments Reference Database System (FIRDS), available from the European Securities and Markets Authority (ESMA) who is an European financial regulatory agency overseeing the capital markets.

来源:https://stackoverflow.com/questions/56293134/importing-nested-json-documents-into-elasticsearch-and-making-them-searchable

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!