apache-flink

How to check DataStream in flink is empty or having data

╄→гoц情女王★ 提交于 2021-01-29 10:33:01
问题 I am new to Apache flink i have a datastream which implements a process function if certain conditions is met then the datastream is valid and if its not meeting the conditions i am writing it to sideoutput. I am able to print the datastream is it possible to check the datastream is empty or null.I tried using datastream.equals(null) method but its not working.Please suggest how to know whether a datastream is empty or not 回答1: By "empty", I assume you mean that no data is flowing. What are

About StateTtlConfig

假装没事ソ 提交于 2021-01-29 10:02:24
问题 I'm configuring my StateTtlConfig for MapState and my interest is the objects into the state has for example 3 hours of life and then they should disappear from state and passed to the GC to be cleaned up and release some memory and the checkpoints should release some weight too I think. I had this configuration before and it seems like it was not working because the checkpoints where always growing up: private final StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(org.apache.flink.api

Apache Flink - Partitioning the stream equally as the input Kafka topic

余生颓废 提交于 2021-01-29 09:46:30
问题 I would like to implement in Apache Flink the following scenario: Given a Kafka topic having 4 partitions, I would like to process the intra-partition data independently in Flink using different logics, depending on the event's type. In particular, suppose the input Kafka topic contains the events depicted in the previous images. Each event have a different structure: partition 1 has the field " a " as key, partition 2 has the field " b " as key, etc. In Flink I would like to apply different

Flink: what's the best way to handle exceptions inside Flink jobs

China☆狼群 提交于 2021-01-29 09:27:15
问题 I have a flink job that takes in Kafaka topics and goes through a bunch of operators. I'm wondering what's the best way to deal with exceptions that happen in the middle. My goal is to have a centralized place to handle those exceptions that may be thrown from different operators and here is my current solution: Use ProcessFunction and output sideOutput to context in the catch block, assuming there is an exception, and have a separate sink function for the sideOutput at the end where it calls

Unable to execute HTTP request: Timeout waiting for connection from pool in Flink

≯℡__Kan透↙ 提交于 2021-01-29 09:02:26
问题 I'm working on an app which uploads some files to an s3 bucket and at a later point, it reads files from s3 bucket and pushes it to my database . I'm using Flink 1.4.2 and fs.s3a API for reading and write files from the s3 bucket. Uploading files to s3 bucket works fine without any problem but when the second phase of my app that is reading those uploaded files from s3 starts, my app is throwing following error : Caused by: java.io.InterruptedIOException: Reopen at position 0 on s3a:/

Issue with job submission from Flink Job UI (Exception:org.apache.flink.client.program.OptimizerPlanEnvironment$ProgramAbortException)

强颜欢笑 提交于 2021-01-29 08:01:43
问题 I have simple java code for flink job List<Tuple2> list = new ArrayList<>(); for (int i = 0; i < 10; i++) { list.add(new Tuple2(Integer.valueOf(i), "test" + i)); } StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.fromCollection(list).print(); env.execute("job1"); I packaged this code and create a jar: say flink-processor-0.1-SNAPSHOT.jar, upload it to JobManager from Submit job UI. No issues in upload. I see the EntryClass has the main class (com.abc

how to configure some external jars library to the flink docker container

纵然是瞬间 提交于 2021-01-29 07:59:20
问题 I am running a flink docker image with the following configuration. version: '2.1' services: jobmanager: build: . image: flink volumes: - .:/usr/local/lib/python3.7/site-packages/pyflink/lib hostname: "jobmanager" expose: - "6123" ports: - "8081:8081" command: jobmanager environment: - JOB_MANAGER_RPC_ADDRESS=jobmanager taskmanager: image: flink volumes: - .:/usr/local/lib/python3.7/site-packages/pyflink/lib expose: - "6121" - "6122" depends_on: - jobmanager command: taskmanager links: -

Create Input Format of Elasticsearch using Flink Rich InputFormat

感情迁移 提交于 2021-01-29 07:06:09
问题 We are using Elasticsearch 6.8.4 and Flink 1.0.18. We have an index with 1 shard and 1 replica in elasticsearch and I want to create the custom input format to read and write data in elasticsearch using apache Flink dataset API with more than 1 input splits in order to achieve better performance. so is there any way I can achieve this requirement? Note: Per document size is larger(almost 8mb) and I can read only 10 documents at a time because of size constraint and per reading request, we

Apache Flink Mapping at Runtime

霸气de小男生 提交于 2021-01-29 07:01:17
问题 i have build a flink streaming job to read a xml file from kafka convert the file and write it in a database. As the attributes in the xml file don't match the database column names i have build a switch case for the mapping. As this is not really flexible i want to take this hardwired mapping information out of the code. First of all i came up with the idea of a mapping file which could look like this: path.in.xml.to.attribut=database.column.name The current job logic looks like this: switch

Apache Flink Mapping at Runtime

你。 提交于 2021-01-29 06:50:16
问题 i have build a flink streaming job to read a xml file from kafka convert the file and write it in a database. As the attributes in the xml file don't match the database column names i have build a switch case for the mapping. As this is not really flexible i want to take this hardwired mapping information out of the code. First of all i came up with the idea of a mapping file which could look like this: path.in.xml.to.attribut=database.column.name The current job logic looks like this: switch