apache-beam-io

How to infer avro schema from a kafka topic in Apache Beam KafkaIO

柔情痞子 提交于 2020-07-03 12:59:10
问题 I'm using Apache Beam's kafkaIO to read from a topic that has an avro schema in Confluent schema registry. I'm able to deserialize the message and write to files. But ultimately i want to write to BigQuery. My pipeline isn't able to infer the schema. How do I extract/infer the schema and attach it to the data in the pipeline so that my downstream processes (write to BigQuery) can infer the schema? Here is the code where I use the schema registry url to set the deserializer and where i read

Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read()

白昼怎懂夜的黑 提交于 2020-06-29 04:20:09
问题 I am reading a PCollection mongodata from the MongoDB and using this PCollection as a sideInput to my ParDo(DoFN).withSideInputs(PCollection) And from Backend my MongoDB collection is updating on a daily or monthly basis or a yearly may be . And i need that newly added value in my pipeline. We can consider this as refreshing the mongo collection value in a running pipeline. For example of mongo collection has total 20K documents and after one day three more records added into mongo collection

Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read() Part 2

回眸只為那壹抹淺笑 提交于 2020-06-17 15:57:29
问题 Not sure about how this GenerateSequence work for me as i have to read values from Mongo periodically on hourly or on daily basis, created a ParDo that reads the MongoDB, also added window into GlobalWindows with an trigger (trigger i will update as pr requirement). But below code snippet giving return type error so could you please help me to correct below lines of code? Also find snapshot of the error. Also how this Generate Sequence help in my case ? PCollectionView<List<String>> list_of

Reading an xml file in apache beam using XmlIo

不羁岁月 提交于 2020-06-17 09:45:14
问题 problem statement: i am trying to read and print contents of an xml file in beam using direct runner here is the code snippet: public class BookStore{ public static void main (string args[]){ BookOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(BookOptions .class); Pipeline pipeline = Pipeline.create(options); PCollection<Book> output = pipeline.apply(XmlIO.<Book>read().from("sample.xml") .withRootElement("book") .withRecordElement("name") .withRecordClass(Book

Why increments are not supported in Dataflow-BigTable connector?

狂风中的少年 提交于 2020-05-13 08:14:32
问题 We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at https://cloud.google.com/bigtable/docs/dataflow-hbase, I see that append/increment operations of the HBase API are not supported by this client. The reason stated is the retry logic on batch mode but if Dataflow guarantees exactly-once, why would supporting it be a bad idea since I know

Why increments are not supported in Dataflow-BigTable connector?

心已入冬 提交于 2020-05-13 08:12:11
问题 We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at https://cloud.google.com/bigtable/docs/dataflow-hbase, I see that append/increment operations of the HBase API are not supported by this client. The reason stated is the retry logic on batch mode but if Dataflow guarantees exactly-once, why would supporting it be a bad idea since I know

Writing tfrecords in apche_beam with java

牧云@^-^@ 提交于 2020-04-18 01:06:00
问题 How can I write the following code in java? If I have list of records/dicts in java how can I write the beam code to write them in tfrecords where tf.train.Examples are serialized. There are lot of examples to do that with python, below is one example in python, how can I write the same logic in java ? import tensorflow as tf import apache_beam as beam from apache_beam.runners.interactive import interactive_runner from apache_beam.coders import ProtoCoder class Foo(beam.DoFn): def process

Apache-Beam: Read parquet files from nested HDFS directories

谁说我不能喝 提交于 2020-01-24 00:27:30
问题 How could I read all parquet files stored in HDFS using Apache-Beam 2.13.0 python sdk with direct runner if the directory structure is the following: data/ ├── a │ ├── file_1.parquet │ └── file_2.parquet └── b ├── file_3.parquet └── file_4.parquet I tried beam.io.ReadFromParquet and hdfs://data/*/* : import apache_beam as beam from apache_beam.options.pipeline_options import PipelineOptions HDFS_HOSTNAME = 'my-hadoop-master-node.com' HDFS_PORT = 50070 HDFS_USER = "my-user-name" pipeline

Apache-Beam: Read parquet files from nested HDFS directories

谁说胖子不能爱 提交于 2020-01-24 00:27:07
问题 How could I read all parquet files stored in HDFS using Apache-Beam 2.13.0 python sdk with direct runner if the directory structure is the following: data/ ├── a │ ├── file_1.parquet │ └── file_2.parquet └── b ├── file_3.parquet └── file_4.parquet I tried beam.io.ReadFromParquet and hdfs://data/*/* : import apache_beam as beam from apache_beam.options.pipeline_options import PipelineOptions HDFS_HOSTNAME = 'my-hadoop-master-node.com' HDFS_PORT = 50070 HDFS_USER = "my-user-name" pipeline

How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>

巧了我就是萌 提交于 2020-01-23 01:39:26
问题 I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode. But it always meets Exception : org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalArgumentException: Duplicate values for mykey Is anything wrong with this snippet code? If I use .discardingFiredPanes() instead, I will lose information in the last emit. pipeline