问题
I have an ES cluster with multiple indices that all receive updates in random time intervals. I have a logstash instance extracting data from ES and passing it into Kafka.
What would be a good method to run this every minute and pickup any updates in ES?
Conf:
input {
elasticsearch {
hosts => [ "hostname1.com:5432", "hostname2.com" ]
index => "myindex-*"
query => "*"
size => 10000
scroll => "5m"
}
}
output {
kafka {
bootstrap-servers => "abc-kafka.com:1234"
topic_id => "my.topic.test"
}
}
I would like to use the documents @timestamp in a query and save it in a temp file, then rerun query (with a schedule) and get the latest updates/insert (something like what the jdbc-input plugin of logstash supports)
Any ideas?
Thank you in advance
回答1:
Someone asked the same thing a few months ago but that issue didn't get much traffic. You can +1 it, maybe.
In the meantime, you could modify the query
in your elasticsearch
input to be like this:
query => '{"query":{"range":{"timestamp":{"gt": "now-1m"}}}}'
i.e. you query all documents whose timestamp
field (arbitrary name, change to match yours) is within the past minute
Then you need to setup a cron that will start your logstash process every minute. Now due to the latency between the moment the cron is triggered, the moment logstash starts running and the moment the query arrives on the ES server side, just know that 1m
might not be sufficient and you risk missing some docs. You need to test this and find out which is best.
According to this recent blog post, another way could be to record the last time Logstash ran in an environment variables LAST_RUN
and use that variable in the query:
query => '{"query":{"range":{"timestamp":{"gt": "${LAST_RUN}"}}}}'
In this scenario, you'd create a shell script that is run by a cron and that does basically this:
- run
logstash -f your_config_file.conf
- when done, set
LAST_RUN=$(date +"%FT%T")
来源:https://stackoverflow.com/questions/35886921/extract-from-elasticsearch-into-kafka-continuously-any-new-es-updates-using-lo