Streaming Data Processing and nano second time resolution

问题

I'm just starting into the topic of real-time stream data processing frameworks, and I have a question to which I as of yet could not find any conclusive answer:

Do the usual suspects (Apache's Spark, Kafka, Storm, Flink etc.) support processing data with an event time resolution of nanoseconds (or even picoseconds)?

Most people and documentation talk about a millisecond or microsecond resolution, but I was unable to find a definite answer if more resolution would be possible or a problem. The only framework I infer to have the capability is influxData's Kapacitor framework, as their TSDB influxDB seems to be storing timestamps at nanosecond resolution.

Can anybody here offer some insights on this or even some informed facts? Alternative solutions/frameworks offering this capability?

Anything would be much appreciated!

Thanks and regards,

Simon

Background of my question: I'm working in an environment with quite a number of proprietary implementations for data storage and processing and thinking about some organization/optimization presently. We are doing plasma physics experiments with a lot of different diagnostic/measurement systems at various sampling rates, now up to "above Giga samples per second". The one common fact/assumption in our systems is that each sample does have a recorded event time in nanosecond resolution. When trying to employ an established stream (or also batch) processing framework, we would have to keep this timestamp resolution. Or go even further as we recently breached the 1 Gsps threshold with some systems. Hence my question.

回答1:

In case this is not clear, you should be aware of the difference between event time and processing time:

event time - time of generation of the event at the source

processing time - time of event execution within processing engine

src: Flink docs

AFAIK Storm doesn't support event time and Spark has limited support. That leaves Kafka Streams and Flink for consideration.

Flink uses long type for timestamps. It is mentioned in the docs that this value is for milliseconds since 1970-01-01T00:00:00Z, but AFAIK, when you use event time characteristic, the only measure of progress are event timestamps. So, if you can fit your values into the long range, then it should be doable.

edit:

In general watermarks (based on timestamps) are used for measuring the progress of event time in windows, triggers etc. So, if you use:

AssignerWithPeriodicWatermarks then a new watermark is emitted in intervals defined in config (autowatermark interval) in processing time domain - even when event time characteristic is used. For details see eg org.apache.flink.streaming.runtime.operators.TimestampsAndPeriodicWatermarksOperator#open() method, where a timer in processing time is registered. So, if autowatermark is set to 500ms, then every 500ms of processing time (as taken from System.currentTimeMillis()) a new watermark is emitted, but the timestamp of the watermark is based on the timestamp from events.
AssignerWithPunctuatedWatermarks then the best description can be found in docs for org.apache.flink.streaming.api.datastream.DataStream#assignTimestampsAndWatermarks(org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks<T>):

Assigns timestamps to the elements in the data stream and creates watermarks to signal event time progress based on the elements themselves.

This method creates watermarks based purely on stream elements. For each element that is handled via AssignerWithPunctuatedWatermarks#extractTimestamp(Object, long), the AssignerWithPunctuatedWatermarks#checkAndGetNextWatermark(Object, long) method is called, and a new watermark is emitted, if the returned watermark value is non-negative and greater than the previous watermark.

This method is useful when the data stream embeds watermark elements, or certain elements carry a marker that can be used to determine the current event time watermark. This operation gives the programmer full control over the watermark generation. Users should be aware that too aggressive watermark generation (i.e., generating hundreds of watermarks every second) can cost some performance.

To understand how watermarks work, this read is highly recommended: Tyler Akidau on Streaming 102

回答2:

While Kafka Streams uses milli-second resolution, the runtime is actually kinda agnostic. In the end it's just longs.

Having said this, the "problem" is the definition of time window. If you specify a time window of 1 minute, but your timestamp resolution is smaller than milli-seconds, your window would be smaller than 1 minute. As a workaround, you can make the window larger, eg, 1000 minutes or 1,000,000 minutes for micro/nano-second resolution.

Another "problem" is, that brokers only understand milli-seconds resolution and that retention time is bases on this. Thus, you would need to set retention time much higher to "trick" the broker and avoid it deletes data too early.

来源：https://stackoverflow.com/questions/54402759/streaming-data-processing-and-nano-second-time-resolution

标签

apache-kafka

apache-flink

apache-storm

apache-kafka-streams

kapacitor