Importing nested JSON documents into Elasticsearch and making them searchable

问题

We have MongoDB-collection which we want to import to Elasticsearch (for now as a one-off effort). For this end, we have exported the collection with monogexport. It is a huge JSON file with entries like the following:

    {
     "RefData" : {
      "DebtInstrmAttrbts" : {
        "NmnlValPerUnit" : "2000",
        "IntrstRate" : {
          "Fxd" : "3.1415"
        },
        "MtrtyDt" : "2020-01-01",
        "TtlIssdNmnlAmt" : "200000000",
        "DebtSnrty" : "SNDB"
      },
      "TradgVnRltdAttrbts" : {
        "IssrReq" : "false",
        "Id" : "BMTF",
        "FrstTradDt" : "2019-04-01T12:34:56.789"
      },
      "TechAttrbts" : {
        "PblctnPrd" : {
          "FrDt" : "2019-04-04"
        },
        "RlvntCmptntAuthrty" : "GB"
      },
      "FinInstrmGnlAttrbts" : {
        "ClssfctnTp" : "DBFNXX",
        "ShrtNm" : "AVGO  3.625  10/16/24 c24 (URegS)",
        "FullNm" : "AVGO 3  5/8  10/15/24 BOND",
        "NtnlCcy" : "USD",
        "Id" : "USU1109MAXXX",
        "CmmdtyDerivInd" : "false"
      },
      "Issr" : "549300WV6GIDOZJTVXXX"
    }

We are using the following Logstash configuration file to import this data set into Elasticsearch:

input {
  file {
    path => "/home/elastic/FIRDS.json"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    codec => json
  }
}
filter {
  mutate {
    remove_field => [ "_id", "path", "host" ]
  }
}
output {
  elasticsearch {
     hosts => [ "localhost:9200" ]
     index => "firds"
  }
}

All this works fine, the data ends up in the index firds of Elasticsearch, and a GET /firds/_search returns all the entries within the _source field. We understand that this field is not indexed and thus is not searchable, which we are actually after. We want make all of the entries within the original nested JSON searchable in Elasticsearch.

We assume that we have to adjust the filter {} part of our Logstash configuration, but how? For consistency reasons, it would not be bad to keep the original nested JSON structure, but that is not a must. Flattening would also be an option, so that e.g.

"RefData" : {
      "DebtInstrmAttrbts" : {
        "NmnlValPerUnit" : "2000" ...

becomes a single key-value pair "RefData.DebtInstrmAttrbts.NmnlValPerUnit" : "2000".

It would be great if we could do that immediately with Logstash, without using an additional Python script operating on the JSON file we exported from MongoDB.

EDIT: Workaround Our current work-around is to (1) dump the MongoDB database to dump.json and then (2) flatten it with jq using the following expression, and finally (3) manually import it into Elastic

ad (2): This is the flattening step:

jq  '. as $in | reduce leaf_paths as $path ({}; . + { ($path | join(".")): $in | getpath($path) })  | del(."_id.$oid") ' 
    -c dump.json > flattened.json

References

Walker Rowe: ElasticSearch Nested Queries: How to Search for Embedded Documents
ElasticSearch search in document and in dynamic nested document
Mapping for Nested JSON document in Elasticsearch
Logstash - import nested JSON into Elasticsearch

Remark for the curious: The shown JSON is a (modified) entry from the Financial Instruments Reference Database System (FIRDS), available from the European Securities and Markets Authority (ESMA) who is an European financial regulatory agency overseeing the capital markets.

来源：https://stackoverflow.com/questions/56293134/importing-nested-json-documents-into-elasticsearch-and-making-them-searchable

标签

mongodb

ElasticSearch

logstash

finance

mongoexport