ElasticSearch river JDBC MySQL not deleting records

倖福魔咒の 提交于 2019-11-27 17:11:29

问题


I'm using the JDBC plugin for ElasticSearch to update my MySQL database. It picks up new and changed records, but does not delete records that have been removed from MySQL. They remain in the index.

This is the code I use to create the river:

curl -XPUT 'localhost:9200/_river/account_river/_meta' -d '{
    "type" : "jdbc",
    "jdbc" : {
        "driver" : "com.mysql.jdbc.Driver",
        "url" : "jdbc:mysql://localhost:3306/test",
        "user" : "test_user",
        "password" : "test_pass",
        "sql" : "SELECT `account`.`id` as `_id`, `account`.`id`, `account`.`reference`, `account`.`company_name`, `account`.`also_known_as` from `account` WHERE NOT `account`.`deleted`",
        "strategy" : "simple",
        "poll" : "5s",
        "versioning" : true,
        "digesting" : false,
        "autocommit" : true,
        "index" : "headphones",
        "type" : "Account"
    }
}'

Installed ElasticSearch via homebrew on OSX Mountain Lion, no errors or problems and everything responds as expected. Permissions OK, no errors in logs.

I have removed, and included (and set to true and false) every combination of autocommit, versioning and digesting that I could think of. It's a dev database so I'm sure that records are deleted fully, not cached and not soft-deleted. If I delete all the records (i.e. leave the river intact and just delete what was indexed on ES), the next time the river updates it does not re-add the record, which leads me to believe I have missed something regarding versioning and deleting.

Note I've also tried various ways to specify the _id column, and I checked that it had a value via JSON on call.

Cheers.


回答1:


Since this question has been asked, the parameters have changed greatly, versioning and digesting have been deprecated, and poll has been replaced by schedule, which will take a cron expression on how often to rerun the river (below is scheduled to run every 5 mins)

    curl -XPUT 'localhost:9200/_river/account_river/_meta' -d '{
        "type" : "jdbc",
        "jdbc" : {
            "driver" : "com.mysql.jdbc.Driver",
            "url" : "jdbc:mysql://localhost:3306/test",
            "user" : "test_user",
            "password" : "test_pass",
            "sql" : "SELECT `account`.`id` as `_id`, `account`.`id`, `account`.`reference`, `account`.`company_name`, `account`.`also_known_as` from `account` WHERE NOT `account`.`deleted`",
            "strategy" : "simple",
            "schedule": "0 0/5 * * * ?" ,
            "autocommit" : true,
            "index" : "headphones",
            "type" : "Account"
        }
    }'

But for the main question, the answer i got from the developer is this https://github.com/jprante/elasticsearch-river-jdbc/issues/213

Deletion of rows is no longer detected.

I tried housekeeping with versioning, but this did not work well together with incremental updates and adding rows.

A good method would be windowed indexing. Each timeframe (maybe once per day or per week) a new index is created for the river, and added to an alias. Old indices are to be dropped after a while. This maintenance is similar to logstash indexing, but it is outside the scope of a river.

The method i am currently using as a I research aliasing is I recreate the index and river nightly, and schedule the river to run every few hours. It ensures new data being put in will be indexed that day, and deletions will reflect every 24 hrs




回答2:


i am still relatively new to elastic and had been using jdbc river for my project. If i understood correctly, which not necessarily could be the case, this is how it works:

  1. Fetches all rows (specified by the SQL statement in the river) from the database.
  2. calculates a digest from (id,type and index) of all the fetched rows (if new rows were added or rows were deleted this should change).
  3. for all rows re-index the documents. This will automatically increment the version of each document.
  4. increment version of the river stored in the _river index (custom)
  5. if the calculated digest in #3 is different than the one that is stored in the _river index then:
    • store it
    • run housekeeping function (deletes all docs with lower version numbers).

so considering that you would want to have a housekeeping running you need to have versioning to be set to true and subsequently this implies that digesting should be set to true as well.

So having said that your river should look like this:

curl -XPUT 'localhost:9200/_river/account_river/_meta' -d '{
    "type" : "jdbc",
    "jdbc" : {
        "driver" : "com.mysql.jdbc.Driver",
        "url" : "jdbc:mysql://localhost:3306/test",
        "user" : "test_user",
        "password" : "test_pass",
        "sql" : "SELECT `account`.`id` as `_id`, `account`.`id`, `account`.`reference`, `account`.`company_name`, `account`.`also_known_as` from `account` WHERE NOT `account`.`deleted`",
        "strategy" : "simple",
        "poll" : "5s",
        "autocommit" : true,
        "index": {
          "index" : "headphones",
          "type" : "Account",
          "versioning" : true,
          "digesting" : true
        }
    }
}'

note that versioning and digesting should be part of index definition and not jdbc definition



来源:https://stackoverflow.com/questions/21260086/elasticsearch-river-jdbc-mysql-not-deleting-records

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!