apache-beam-io | 易学教程

How to infer avro schema from a kafka topic in Apache Beam KafkaIO

阅读更多关于 How to infer avro schema from a kafka topic in Apache Beam KafkaIO

问题 I'm using Apache Beam's kafkaIO to read from a topic that has an avro schema in Confluent schema registry. I'm able to deserialize the message and write to files. But ultimately i want to write to BigQuery. My pipeline isn't able to infer the schema. How do I extract/infer the schema and attach it to the data in the pipeline so that my downstream processes (write to BigQuery) can infer the schema? Here is the code where I use the schema registry url to set the deserializer and where i read

Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read()

阅读更多关于 Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read()

问题 I am reading a PCollection mongodata from the MongoDB and using this PCollection as a sideInput to my ParDo(DoFN).withSideInputs(PCollection) And from Backend my MongoDB collection is updating on a daily or monthly basis or a yearly may be . And i need that newly added value in my pipeline. We can consider this as refreshing the mongo collection value in a running pipeline. For example of mongo collection has total 20K documents and after one day three more records added into mongo collection

Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read() Part 2

阅读更多关于 Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read() Part 2

问题 Not sure about how this GenerateSequence work for me as i have to read values from Mongo periodically on hourly or on daily basis, created a ParDo that reads the MongoDB, also added window into GlobalWindows with an trigger (trigger i will update as pr requirement). But below code snippet giving return type error so could you please help me to correct below lines of code? Also find snapshot of the error. Also how this Generate Sequence help in my case ? PCollectionView<List<String>> list_of

Reading an xml file in apache beam using XmlIo

阅读更多关于 Reading an xml file in apache beam using XmlIo

问题 problem statement: i am trying to read and print contents of an xml file in beam using direct runner here is the code snippet: public class BookStore{ public static void main (string args[]){ BookOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(BookOptions .class); Pipeline pipeline = Pipeline.create(options); PCollection<Book> output = pipeline.apply(XmlIO.<Book>read().from("sample.xml") .withRootElement("book") .withRecordElement("name") .withRecordClass(Book

Why increments are not supported in Dataflow-BigTable connector?

阅读更多关于 Why increments are not supported in Dataflow-BigTable connector?

问题 We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at https://cloud.google.com/bigtable/docs/dataflow-hbase, I see that append/increment operations of the HBase API are not supported by this client. The reason stated is the retry logic on batch mode but if Dataflow guarantees exactly-once, why would supporting it be a bad idea since I know

Why increments are not supported in Dataflow-BigTable connector?

阅读更多关于 Why increments are not supported in Dataflow-BigTable connector?

Writing tfrecords in apche_beam with java

阅读更多关于 Writing tfrecords in apche_beam with java

问题 How can I write the following code in java? If I have list of records/dicts in java how can I write the beam code to write them in tfrecords where tf.train.Examples are serialized. There are lot of examples to do that with python, below is one example in python, how can I write the same logic in java ? import tensorflow as tf import apache_beam as beam from apache_beam.runners.interactive import interactive_runner from apache_beam.coders import ProtoCoder class Foo(beam.DoFn): def process

Apache-Beam: Read parquet files from nested HDFS directories

阅读更多关于 Apache-Beam: Read parquet files from nested HDFS directories

问题 How could I read all parquet files stored in HDFS using Apache-Beam 2.13.0 python sdk with direct runner if the directory structure is the following: data/ ├── a │ ├── file_1.parquet │ └── file_2.parquet └── b ├── file_3.parquet └── file_4.parquet I tried beam.io.ReadFromParquet and hdfs://data/*/* : import apache_beam as beam from apache_beam.options.pipeline_options import PipelineOptions HDFS_HOSTNAME = 'my-hadoop-master-node.com' HDFS_PORT = 50070 HDFS_USER = "my-user-name" pipeline

Apache-Beam: Read parquet files from nested HDFS directories

阅读更多关于 Apache-Beam: Read parquet files from nested HDFS directories

How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>

阅读更多关于 How to solve Duplicate values exception when I create PCollectionView

问题 I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode. But it always meets Exception : org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalArgumentException: Duplicate values for mykey Is anything wrong with this snippet code? If I use .discardingFiredPanes() instead, I will lose information in the last emit. pipeline