问题
Since I'm not allowed to set up Flume on prod servers, I have to download the logs, put them in a Flume spoolDir and have a sink to consume from the channel and write to Cassandra. Everything is working fine.
However, as I have a lot of log files in the spoolDir, and the current setup is only processing 1 file at a time, it's taking a while. I want to be able to process many files concurrently. One way I thought of is to use the spoolDir but distribute the files into 5-10 different directories, and define multiple sources/channels/sinks, but this is a bit clumsy. Is there a better way to achieve this?
Thanks
回答1:
Just for the record, this has been answered in Flume's mailing list:
Hari Shreedharan wrote:
Unfortunately, no. The spoolDir source was kept single-threaded so that deserializer implementations can be kept simple. The approach with mutliple spoolDir sources is the correct one, though they can all write to the same channel(s) - so you'd need only a larger number of sources, they can all share the same channel(s) and you don't need more sinks unless you want to pull data out faster.
http://mail-archives.apache.org/mod_mbox/flume-user/201409.mbox/browser
来源:https://stackoverflow.com/questions/25875574/reading-flume-spooldir-in-parallel