Is there a smarter way to reindex elasticsearch?

前端 未结 4 1468
独厮守ぢ
独厮守ぢ 2020-12-04 05:18

I ask because our search is in a state of flux as we work things out, but each time we make a change to the index (change tokenizer or filter, or number of shards/replicas),

相关标签:
4条回答
  • 2020-12-04 06:12

    I think @karmi makes it right. However let me explain it a bit simpler. I needed to occasionally upgrade production schema with some new properties or analysis settings. I recently started to use the scenario described below to do live, constant load, zero-downtime index migrations. You can do that remotely.

    Here are steps:

    Assumptions:

    • You have index real1 and aliases real_write, real_read pointing to it,
    • the client writes only to real_write and reads only from real_read ,
    • _source property of document is available.

    1. New index

    Create real2 index with new mapping and settings of your choice.

    2. Writer alias switch

    Using following bulk query switch write alias.

    curl -XPOST 'http://esserver:9200/_aliases' -d '
    {
        "actions" : [
            { "remove" : { "index" : "real1", "alias" : "real_write" } },
            { "add" : { "index" : "real2", "alias" : "real_write" } }
        ]
    }'
    

    This is atomic operation. From this time real2 is populated with new client's data on all nodes. Readers still use old real1 via real_read. This is eventual consistency.

    3. Old data migration

    Data must be migrated from real1 to real2, however new documents in real2 can't be overwritten with old entries. Migrating script should use bulk API with create operation (not index or update). I use simple Ruby script es-reindex which has nice E.T.A. status:

    $ ruby es-reindex.rb http://esserver:9200/real1 http://esserver:9200/real2
    

    UPDATE 2017 You may consider new Reindex API instead of using the script. It has lot of interesting features like conflicts reporting etc.

    4. Reader alias switch

    Now real2 is up to date and clients are writing to it, however they are still reading from real1. Let's update reader alias:

    curl -XPOST 'http://esserver:9200/_aliases' -d '
    {
        "actions" : [
            { "remove" : { "index" : "real1", "alias" : "real_read" } },
            { "add" : { "index" : "real2", "alias" : "real_read" } }
        ]
    }'
    

    5. Backup and delete old index

    Writes and reads go to real2. You can backup and delete real1 index from ES cluster.

    Done!

    0 讨论(0)
  • 2020-12-04 06:16

    Yes, there are smarter ways how to re-index your data without downtime.

    First, never, ever use the "final" index name as your real index name. So, if you'd like to name your index "articles", don't use that name as a physical index, but create an index such as "articles-2012-12-12" or "articles-A", "articles-1", etc.

    Second, create an alias "alias" pointing to that index. Your application will then use this alias, so you'll never need to manually change the index name, restart the application, etc.

    Third, when you want or need to re-index the data, re-index them into a different index, let's say "articles-B" -- all the tools in Tire's indexing toolchaing support you here.

    When you're done, point the alias to the new index. In this way, you not only minimize downtime (there isn't any), you also have a safe snapshot: if you somehow mess up the indexing into the new index, you can just switch back to the old one, until you resolve the issue.

    0 讨论(0)
  • 2020-12-04 06:16

    Maybe create another index, and reindex all the data onto that one, and then make the switch when it's done re-indexing ?

    0 讨论(0)
  • 2020-12-04 06:17

    Wrote up a blog post about how I handled reindexing with no downtime recently. Takes some time to figure out all the little things that need to be in place to do so. Hope this helps!

    https://summera.github.io/infrastructure/2016/07/04/reindexing-elasticsearch.html

    To summarize:

    Step 1: Prepare New Index

    Create your new index with your new mapping. This can be on the same instance of Elasticsearch or on a brand new instance.

    Step 2: Keep Indexes Up To Date

    While you're reindexing you want to keep both your new and old indexes up to date. For a write operation, this can be done by sending the write operation to a background worker on both the new and old index.

    Deletes are a bit trickier because there is a race condition between deleting and reindexing the record into the new index. So, you'll want to keep track of the records that need to be deleted during your reindex and process these when you are finished. If you aren't performing many deletes, another way would be to eliminate the possibility of a delete during your reindex.

    Step 3: Perform Reindexing

    You’ll want to use a scrolled search for reading the data and bulk API for inserting. Since after Step 2 you'll be writing new and updated documents to the new index in the background, you want to make sure you do NOT update existing documents in the new index with your bulk API requests.

    This means that the operation you want for your bulk API requests is create, not index. From the documentation: “create will fail if a document with the same index and type exists already, whereas index will add or replace a document as necessary”. The main point here is you do not want old data from the scrolled search snapshot to overwrite new data in the new index.

    There's a great script on github to help you with this process: es-reindex.

    Step 4: Switch Over

    Once you’re finished reindexing, it’s time to switch your search over to the new index. You’ll want to turn deletes back on or process the enqueued delete jobs for the new index. You may notice that searching the new index is a bit slow at first. This is because Elasticsearch and the JVM need time to warm up.

    Perform any code changes you need so your application starts searching the new index. You can continue writing to the old index incase you run into problems and need to rollback. If you feel this is unnecessary, you can stop writing to it.

    Step 5: Clean Up

    At this point you should be completely transitioned to the new index. If everything is going well, perform any necessary cleanup such as:

    • Delete the old index host if it’s different from the new
    • Remove serialization code related to your old index
    0 讨论(0)
提交回复
热议问题