dump elasticsearch 2.x to mongodb and back to ES 6.x

问题

This question is more of theoretical than source code.

I have a ES 2.x node which has more than 1.2TB data. We have 40+ indices with each having at-least 1 type. Here, ES 2.x is used as a database rather than as a search engine. The source which was used to dump data into ES 2.x is lost. Also, data is not normalised but a single ES document has multiple embedded documents. Our aim is to recreate the data source and at the same time to normalise it.

What we are planning is:

Retrieve data from ES, analyse it and dump it into new mongodb to specific collections and maintain the relations between data. ie. save in normalised form.
Index the new mongo data on a new ES 6 node.

We are using JRuby 9.1.15.0, Rails 5, Ruby 2.4 and Sidekiq.

Currently, we are retrieving data from ES for a specific date-time range. Sometimes we receive 0 records and sometimes 100000+. The problem is when we receive huge number of records.

Here is a sample script that works when the data for a date range is small but fails when the data is large. 1.2TB/40 indices is the avg index size.

class DataRetrieverWorker
  include Sidekiq::Worker
  include Sidekiq::Status::Worker

  def perform(indx_name, interval = 24, start_time = nil, end_time = nil)
    unless start_time || end_time
      client = ElasticSearchClient.instance.client
      last_retrieved_at = RetrievedIndex.where(name: indx_name).desc(:created_at).first
      start_time, end_time = unless last_retrieved_at
                               data = client.search index: indx_name, size: 1, sort: [{ insert_time: { order: 'asc' } }]
                               first_day = DateTime.parse(data['hits']['hits'].first['_source']['insert_time'])
                               start_time = first_day.beginning_of_day
                               end_time = first_day.end_of_day
                             else
                               # retrieve for the next time slot. usually 24 hrs.
                               [last_retrieved_at.end_time, last_retrieved_at.end_time + interval.hours]
                             end
      DataRetrieverWorker.perform_async(indx_name, interval, start_time, end_time)
    else
       # start scroll on the specified range and retrieve data.
       query = { range: { insert_time: { gt: DateTime.parse(start_time).utc.iso8601, lt: DateTime.parse(end_time).utc.iso8601 } } }
       data = client.search index: indx_name, scroll: '10m', size: SCROLL_SIZE, body: { query: query }
      ri = RetrievedIndex.find_by(name: indx_name, start_time: start_time, end_time: end_time)
      if ri
        DataRetrieverWorker.perform_at(2.seconds.from_now, indx_name, interval)
        return
      end
      ri = RetrievedIndex.create!(name: indx_name, start_time: start_time, end_time: end_time, documents_cnt: data['hits']['total'])
      if data['hits']['total'] > 0
        if data['hits']['total'] > 2000
          BulkJobsHandlerWorker.perform_async(ri.id.to_s, data['hits']['hits'])
          while data = client.scroll(body: { scroll_id: data['_scroll_id'] }, scroll: '10m') and not data['hits']['hits'].empty? do
            BulkJobsHandlerWorker.perform_async(ri.id.to_s, data['hits']['hits'])
          end
        else
          data['hits']['hits'].each do |r|
            schedule(r)
            ri.retrieved_documents.find_or_create_by!(es_id: r['_id'], es_index: indx_name)
          end
          while data = client.scroll(body: { scroll_id: data['_scroll_id'] }, scroll: '10m') and not data['hits']['hits'].empty? do
            data['hits']['hits'].each do |r|
              schedule(r)
              ri.retrieved_documents.find_or_create_by!(es_id: r['_id'], es_index: indx_name)
            end
          end
        end
      else
        DataRetrieverWorker.perform_async(indx_name, interval)
        return
      end
      DataRetrieverWorker.perform_at(indx_name, interval)
    end
  end

  private

  def schedule(data)
    DataPersisterWorker.perform_async(data)
  end
end

Questions:

What should be the ideal approach to retrieve data from ES 2.x. We are retrieving data via date range and then use the scroll api to retrieve the result set. Is this right?
What should be done when we get large result for a particular time range. Sometimes, we get 20000+ records for a time range of few minutes. What should be the ideal approach?
Is sidekiq the right library for this amount of data processing?
What should be the ideal configuration of the server running sidekiq?
Is using date range the right approach to retrieve data? the number of documents varies a lot. 0 or 100000+.
Is there any better approach that would give me uinform number of records irrespective of the time range?
I tried using scroll api independently of the time range but then for a index with 100cr records, is it right to use scroll with size 100(100 results for api call to ES)? 8.The data in the indices is continuously being added. None of the documents are updated.

We have tested our code and it handles nominal data(say 4-5k documents) per datetime range(say 6 hrs). We are also planning to shard the data. Since we need some ruby callbacks to be executed whenever we add/update records in some collections, we will be using Mongoid for the same. Direct data insertion in mongodb without mongoid is not an option.

Any pointers would be helpful. Thanks.

回答1:

In my opinion you should assume that process may fail at any stage.

IMHO, you shouldn't download all the documents but just IDS of matching date range documents. This will significatnly decrease amount of data returned by ElasticSearch

With these IDS you could execute in the background another worker (let's call it ImporterWorker) with IDS as input which would download whole documents from ElasticSearch and export them to MongoDB.

Additionally, in cases when you get let's say 1_000_000 IDS you could split them into N smaller chunks (200 X 5_000) and enqueue N jobs.

Benefits:

with splitting into chunks - you don't have the risk of getting high volume responses from ElasticSearch, because chunk size determines maximum size of ElasticSearch response
when something fails (temporary networking problem or anything), you would rerun the ImporterWorker with IDS you originally triggerd it with and everything would work without interrupting the whole process. and even if it failed - you would know exact IDS which were not imported

回答2:

What should be the ideal approach to retrieve data from ES 2.x. We are retrieving data via date range and then use the scroll api to retrieve the result set. Is this right?

Are the data is continuously increasing in ES?

What should be done when we get large result for a particular time range. Sometimes, we get 20000+ records for a time range of few minutes. What should be the ideal approach?

You are using scroll api that is good approach. You can give a try to Sliced Scroll API of ES.

Is sidekiq the right library for this amount of data processing?

Yes sidekiq is good and can process this amount of data.

What should be the ideal configuration of the server running sidekiq?

What is your current configuration of the server running sidekiq?

Is using date range the right approach to retrieve data? the number of documents varies a lot. 0 or 100000+.

You are not holding 100000+ results at a time. You are processing them in chunks using scroll API. If data is not continuely added in ES then use query with match_all: {} with scroll api. if data is continually added then date range is fine approach.

Is there any better approach that would give me uniform number of records irrespective of the time range?

Yes, If you use without using date range. Scan all documents form 0 to last with scroll api.

I tried using scroll api independently of the time range but then for a index with 100cr records, is it right to use scroll with size 100(100 results for api call to ES)?

You can increase scroll size as mongodb supports bulk insertion of documents. MongoDB Bulk Insert

Below points may resolve your issue:

Clearing scroll_id after processing previous batch may improve performance.

Scroll requests have optimisations that make them faster when the sort order is _doc. If you want to iterate over all documents regardless of the order, this is the most efficient option.
The scroll parameter tells Elasticsearch how long it should keep the search context alive. Its value (e.g. 1m) does not need to be long enough to process all data, it just needs to be long enough to process the previous batch of results. Each scroll request sets a new expiry time.
Search context are automatically removed when the scroll timeout has been exceeded. However keeping scrolls open has a cost (discussed later in the performance section) so scrolls should be explicitly cleared as soon as the scroll is not being used anymore using the clear-scroll API:
Scroll API : The background merge process optimizes the index by merging together smaller segments to create new bigger segments, at which time the smaller segments are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted while they are still in use. This is how Elasticsearch is able to return the results of the initial search request, regardless of subsequent changes to documents. Keeping older segments alive means that more file handles are needed. Ensure that nodes have been configured to have ample free file handles and scroll API context is cleared soon after data fetch. We can check how many search contexts are open with the nodes stats API:

It is thus very necessary to clear the Scroll API Context as described earlier in Clear Scroll API section.

Source

回答3:

A very handy tool to backup/restore or re-index data based on Elasticsearch queries is Elasticdump.

To backup complete indices, the Elasticsearch snapshot API is the right tool. The snapshot API provides operations to create and restore snapshots of whole indices, stored in files, or in Amazon S3 buckets.

Let’s have a look at a few examples for Elasticdump and snapshot backups and recovery.

Install elasticdump with the node package manager
```
npm i elasticdump -g
```

Backup by query to a zip file:

elasticdump --input='http://username:password@localhost:9200/myindex' --searchBody '{"query" : {"range" :{"timestamp" : {"lte": 1483228800000}}}}' --output=$ --limit=1000 | gzip > /backups/myindex.gz

Restore from a zip file:

zcat /backups/myindex.gz | elasticdump --input=$ --output=http://username:password@localhost:9200/index_name

Examples for backup and restore data with snapshots to Amazon S3 or files

First configure the snapshot destination

S3 example

curl 'localhost:9200/_snapshot/my_repository?pretty' 
-XPUT -H 'Content-Type: application/json' 
-d '{
"type" : "s3",
"settings" : {
   "bucket" : "test-bucket",
   "base_path" : "backup-2017-01",
   "max_restore_bytes_per_sec" : "1gb",
   "max_snapshot_bytes_per_sec" : "1gb",
   "compress" : "true",
   "access_key" : "<ACCESS_KEY_HERE>",
   "secret_key" : "<SECRET_KEY_HERE>"
   }
}'

Local disk or mounted NFS example

curl 'localhost:9200/_snapshot/my_repository?pretty' -XPUT -H 'Content-Type: application/json' -d '{
"type" : "fs",
"settings" : {
   "location": "<PATH … for example /mnt/storage/backup>"
}
}'

Trigger snapshot

curl -XPUT 'localhost:9200/_snapshot/my_repository/<snapshot_name>'

Show all backups

curl 'localhost:9200/_snapshot/my_repository/_all'

Restore – the most important part of backup is verifying that backup restore actually works!
```
curl -XPOST 'localhost:9200/_snapshot/my_repository/<snapshot_name>/_restore'
```

This text was found in:
https://sematext.com/blog/elasticsearch-security-authentication-encryption-backup/

来源：https://stackoverflow.com/questions/48605000/dump-elasticsearch-2-x-to-mongodb-and-back-to-es-6-x

标签

ruby

ElasticSearch

sidekiq