Kafka Storm HDFS/S3 data flow

问题

It is unclear if you can do a fan-out (duplication) in Kafka like you can in Flume.

I'd like to have Kafka save data to HDFS or S3 and send a duplicate of that data to Storm for real time processing. The output of Storm aggregations/analysis will be stored in Cassandra. I see some implementations flowing all data from Kafka into Storm and then two outputs from Storm. However, I'd like to eliminate the dependency of Storm for the raw data storage.

Is this possible? Are you aware of any documentation/examples/implementations like this?

Also, does Kafka have good support for S3 storage?

I saw Camus for storing to HDFS -- do you just run this job via cron to continually load data from Kafka to HDFS? What happens if a second instance of the job starts before the previous has finished? Finally, would Camus work with S3?

Thanks -- I appreciate it!

回答1:

Regarding Camus, Yeah, a scheduler that launches the job should work. What they use at LinkedIn is Azkaban, you can look at that too.

If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.

Regarding Camus with S3, currently I dont think that is in place.

回答2:

Regarding Kafka support for S3 storage, there are several Kafka S3 consumers you can easily plugin to get your data saved to S3. kafka-s3-storage is one of them.

回答3:

There are many possible ways to feed storm with translated data. The main question that is not clear to me is what the dependency you wish to eliminate and what tasks you wish to keep storm from doing. If it is considered ok that storm would receive an xml or json, you could easily read from the original queue using two consumers. As each consumer controls the messages it reads, both could read the same messages. One consumer could insert the data to your storage and the other will translate the information and send it to storm. There is no real complexity with the feasibiliy of this, but, I believe this is not the ideal solution due to the following reasons:

Maintainability - a consumer needs supervision. You would therefor need to supervise your running consumers. Depending on your deployment and the way you handle data types, this might be a non-trivial effort. Especially, when you already have storm installed and therefore supervised.
Storm connectiviy - you still need to figure out how to connect this data to storm. Srorm has a kafka spout, that i have used, and works very well. But, using the suggested architecture , this means an additional kafka topic to place the translated messages on. This is not very efficient as the spout could also read information directly from the original topic and translate it using a simple bolt.

Suggested way to handle this would be to form a topology, using kafka spout to read raw data and one bolt to send the raw data to storage and another one to translate it. But, this solution depends on the reasons you wish to keep storm out of the raw data business.

回答4:

Kafka actually retains events for a configurable period of time -- events are not purged immediately upon consumption like other message or queue systems. This allows you to have multiple consumers that can read from Kafka either at the beginning (per the configurable retention time) or from an offset.

For the use case described, you would use Camus to batch load events to hadoop, and Storm to read events off the same Kafka output. Just ensure both processes read new events before the configurable retention time expires.

Regarding Camus, ggupta1612 answered this aspect best

A scheduler that launches the job should work. What they use at LinkedIn is Azkaban, you can look at that too.

If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.

来源：https://stackoverflow.com/questions/17255714/kafka-storm-hdfs-s3-data-flow

标签

HDFS

apache-kafka

apache-storm