问题
I've been looking at ElasticSearch as solution get some better search and analytics functionality at my company. All of our data is in SQL Server at the moment and I've successfully installed the JDBC River and gotten some test data into ES.
Rivers seem like they can be deprecated in future releases and the JDBC river is maintained by a third party. And Logstash doesn't seem to support indexing from SQL Server yet (don't know if its a planned feature).
So for my situation where I want to move data from SQL Server to ElasticSearch, what's the preferred method of indexing data and maintaining the index as SQL gets updated with new data?
From the linked thread:
We recommend that you own your indexing process out-of-band from ES and make sure it scales with your needs.
I'm not quite sure where to start with this. Is it on me to use one of the APIs ES provides?
回答1:
We use RabbitMQ to pipe data from SQL Server to ES. That way Rabbit takes care of the queuing and processing.
As a note, we can run over 4000 records per second from SQL into Rabbit. We do a bit more processing before putting the data into ES but we still insert into ES at over 1000 records per second. Pretty damn impressive on both ends. Rabbit and ES are both awesome!
回答2:
There are a lot of things that you can do. You can put your data in rabbitmq or redis, but your main problem is staying up to date. I guess you should look into an event based application. But if you really only have the sql server as a datasource you could work with timestamps and a query that checks for updates. Depending on the size of your database you can also just reindex the complete dataset.
Using events or the query based solution, you can push these updates to elasticsearch, probably using the bulk api.
The good part about a custom solution like this is that you can think about your mapping. This is important if you really want to do something smart with your data.
来源:https://stackoverflow.com/questions/22237111/preferred-method-of-indexing-bulk-data-into-elasticsearch