flink-streaming

Can I use a custom partitioner with group by?

喜欢而已 提交于 2019-12-13 02:44:10
问题 Let's say that I know that my dataset is unbalanced and I know the distribution of the keys. I'd like leverage this to write a custom partitioner to get the most out of the operator instances. I know about DataStream#partitionCustom. However, if my stream is keyed, will it still work properly? My job would look something like: KeyedDataStream afterCustomPartition = keyedStream.partitionCustom(new MyPartitioner(), MyPartitionKeySelector()) DataStreamUtils.reinterpretAsKeyedStream

Read data from Redis to Flink

帅比萌擦擦* 提交于 2019-12-12 23:16:04
问题 I have been trying to find a connector to read data from Redis to Flink. Flink's documentation contains the description for a connector to write to Redis. I need to read data from Redis in my Flink job. In Using Apache Flink for data streaming, Fabian has mentioned that it is possible to read data from Redis. What is the connector that can be used for the purpose? 回答1: We are running one in production that looks roughly like this class RedisSource extends RichSourceFunction[SomeDataType] {

Enriching DataStream using static DataSet in Flink streaming

ⅰ亾dé卋堺 提交于 2019-12-12 19:06:08
问题 I am writing a Flink streaming program in which I need to enrich a DataStream of user events using some static data set (information base, IB). For E.g. Let's say we have a static data set of buyers and we have an incoming clickstream of events, for each event we want to add a boolean flag indicating whether the doer of the event is a buyer or not. An ideal way to achieve this would be to partition the incoming stream by user id, have the buyers set available in a DataSet partitioned again by

What's the difference between a watermark and a trigger in Flink?

你。 提交于 2019-12-12 18:27:27
问题 I read that, "..The ordering operator has to buffer all elements it receives. Then, when it receives a watermark it can sort all elements that have a timestamp that is lower than the watermark and emit them in the sorted order. This is correct because the watermark signals that not more elements can arrive that would be intermixed with the sorted elements..." - https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams Hence, it seems that the watermark serves as a signal to

Flink exactly-once message processing

前提是你 提交于 2019-12-12 18:17:27
问题 I've setup a Flink 1.2 standalone cluster with 2 JobManagers and 3 TaskManagers and I'm using JMeter to load-test it by producing Kafka messages / events which are then processed. The processing job runs on a TaskManager and it usually takes ~15K events/s. The job has set EXACTLY_ONCE checkpointing and is persisting state and checkpoints to Amazon S3. If I shutdown the TaskManager running the job it takes a bit, a few seconds, then the job is resumed on a different TaskManager. The job mainly

Confused about FLINK task slot

廉价感情. 提交于 2019-12-12 11:07:25
问题 I know a task manager could have several task slots. But, what is a task slot ? A JVM process or an object in memory or a thread? 回答1: The answer might come late. But: A Taskmanager (TM) is a JVM process, whereas a Taskslot (TS) is a Thread within the respective JVM process (TM). The managed memory of a TM is equally split up between the TS within a TM. No CPU isolation happens between the slots, just the managed memory is divided. Moreover, TS in the same TM share TCP connections (via

ClassNotFoundException: org.apache.flink.streaming.api.checkpoint.CheckpointNotifier while consuming a kafka topic

强颜欢笑 提交于 2019-12-12 10:56:03
问题 I am using the latest Flink-1.1.2-Hadoop-27 and flink-connector-kafka-0.10.2-hadoop1 jars. Flink consumer is as below: StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment(); if (properties == null) { properties = new Properties(); InputStream props = Resources.getResource(KAFKA_CONFIGURATION_FILE).openStream(); properties.load(props); DataStream<String> stream = env.addSource(new FlinkKafkaConsumer082<>(KAFKA_SIP_TOPIC, new SimpleStringSchema() , properties));

How to stop a flink streaming job from program

让人想犯罪 __ 提交于 2019-12-12 08:36:41
问题 I am trying to create a JUnit test for a Flink streaming job which writes data to a kafka topic and read data from the same kafka topic using FlinkKafkaProducer09 and FlinkKafkaConsumer09 respectively. I am passing a test data in the produce: DataStream<String> stream = env.fromElements("tom", "jerry", "bill"); And checking whether same data is coming from the consumer as: List<String> expected = Arrays.asList("tom", "jerry", "bill"); List<String> result = resultSink.getResult(); assertEquals

Flink thowing serialization error when reading from hbase

非 Y 不嫁゛ 提交于 2019-12-12 05:08:58
问题 When I read from hbase using richfatMapFunction inside a map I am getting serialization error. What I am trying to do is if a datastream equals to a particular string read from hbase else ignore. Below is the sample program and error I am getting. package com.abb.Flinktest import java.text.SimpleDateFormat import java.util.Properties import scala.collection.concurrent.TrieMap import org.apache.flink.addons.hbase.TableInputFormat import org.apache.flink.api.common.functions.RichFlatMapFunction

Flink with Kafka Consumer doesn't work

大城市里の小女人 提交于 2019-12-12 05:08:30
问题 I want to benchmark Spark vs Flink, for this purpose I am making several tests. However Flink doesn't work with Kafka, meanwhile with Spark works perfect. The code is very simple: val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment val properties = new Properties() properties.setProperty("bootstrap.servers", "localhost:9092") properties.setProperty("group.id", "myGroup") println("topic: "+args(0)) val stream = env.addSource(new FlinkKafkaConsumer09[String]