apache-beam-io

How to log incoming messages in apache beam pipeline

为君一笑 提交于 2020-01-16 19:06:49
问题 I am writing a simple apache beam streaming pipeline, taking input from a pubsub topic and storing this into bigquery. For hours I thought I am not able to even read a message, as I was simply trying to log the input to console: events = p | 'Read PubSub' >> ReadFromPubSub(subscription=SUBSCRIPTION) logging.info(events) When I write this to text it works fine! However my call to the logger never happens. How to people develop / debug these streaming pipelines? I have tried adding the

Dataflow GroupBy -> multiple outputs based on keys

本小妞迷上赌 提交于 2020-01-07 05:02:13
问题 Is there any simple way that I can redirect the output of GroupBy into multiple output files based on Group keys? Bin.apply(GroupByKey.<String, KV<Long,Iterable<TableRow>>>create()) .apply(ParDo.named("Print Bins").of( ... ) .apply(TextIO.Write.to(*Output file based on key*)) If Sink is the solution, would you please share a sample code w/ me? Thanks! 回答1: Beam 2.2 will include an API to do just that - TextIO.write().to(DynamicDestinations) , see source. For now, if you'd like to use this API

How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

本小妞迷上赌 提交于 2020-01-06 14:13:40
问题 I am using Apache Beam 2.6 to read from a single Kafka topic and write the output to Google Cloud Storage (GCS). Now I want to alter the pipeline so that it is reading multiple topics and writing them out as gs://bucket/topic/... When reading only a single topic I used TextIO in the last step of my pipeline: TextIO.write() .to( new DateNamedFiles( String.format("gs://bucket/data%s/", suffix), currentMillisString)) .withWindowedWrites() .withTempDirectory( FileBasedSink

How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

↘锁芯ラ 提交于 2020-01-06 14:13:32
问题 I am using Apache Beam 2.6 to read from a single Kafka topic and write the output to Google Cloud Storage (GCS). Now I want to alter the pipeline so that it is reading multiple topics and writing them out as gs://bucket/topic/... When reading only a single topic I used TextIO in the last step of my pipeline: TextIO.write() .to( new DateNamedFiles( String.format("gs://bucket/data%s/", suffix), currentMillisString)) .withWindowedWrites() .withTempDirectory( FileBasedSink

JDBC Fetch from oracle with Beam

岁酱吖の 提交于 2019-12-24 06:49:18
问题 The below program is to connect to Oracle 11g and fetch the records. How ever it is giving me NullPointerException for the coder at pipeline.apply(). I have added the ojdbc14.jar to the project dependencies. public static void main(String[] args) { Pipeline p = Pipeline.create(PipelineOptionsFactory.create()); p.apply(JdbcIO.<KV<Integer, String>>read() .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create( "oracle.jdbc.driver.OracleDriver", "jdbc:oracle:thin:@hostdnsname:port

Using CoGroupByKey with custom type ends up in a Coder error

与世无争的帅哥 提交于 2019-12-24 03:01:29
问题 I want to join two PCollection (from a different input respectively) and implement by following the step described here, "Joins with CoGroupByKey" section: https://cloud.google.com/dataflow/model/group-by-key In my case, I want to join GeoIP's "block" information and "location" information. So I defined Block and Location as a custom class and then wrote like below: final TupleTag<Block> t1 = new TupleTag<Block>(); final TupleTag<Location> t2 = new TupleTag<Location>(); PCollection<KV<Long,

Using CoGroupByKey with custom type ends up in a Coder error

不问归期 提交于 2019-12-24 03:01:03
问题 I want to join two PCollection (from a different input respectively) and implement by following the step described here, "Joins with CoGroupByKey" section: https://cloud.google.com/dataflow/model/group-by-key In my case, I want to join GeoIP's "block" information and "location" information. So I defined Block and Location as a custom class and then wrote like below: final TupleTag<Block> t1 = new TupleTag<Block>(); final TupleTag<Location> t2 = new TupleTag<Location>(); PCollection<KV<Long,

Using defaultNaming for dynamic windowed writes in Apache Beam

安稳与你 提交于 2019-12-24 00:45:16
问题 I am following along with answer to this post and the documentation in order to perform a dynamic windowed write on my data at the end of a pipeline. Here is what I have so far: static void applyWindowedWrite(PCollection<String> stream) { stream.apply( FileIO.<String, String>writeDynamic() .by(Event::getKey) .via(TextIO.sink()) .to("gs://some_bucket/events/") .withNaming(key -> defaultNaming(key, ".json"))); } But NetBeans warns me about a syntax error on the last line: FileNaming is not

How to speedup bulk importing into google cloud datastore with multiple workers?

不想你离开。 提交于 2019-12-23 19:28:47
问题 I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the datastore sink. The workflow works fine but the cons I noticed is that: The write speed into datastore is at most around 25-30 entities per second. I tried to use --autoscalingAlgorithm=THROUGHPUT_BASED --numWorkers=10 --maxNumWorkers=100 but the execution seems to prefer one worker (see graph below:

How can I improve performance of TextIO or AvroIO when reading a very large number of files?

早过忘川 提交于 2019-12-22 08:24:56
问题 TextIO.read() and AvroIO.read() (as well as some other Beam IO's) by default don't perform very well in current Apache Beam runners when reading a filepattern that expands into a very large number of files - for example, 1M files. How can I read such a large number of files efficiently? 回答1: When you know in advance that the filepattern being read with TextIO or AvroIO is going to expand into a large number of files, you can use the recently added feature .withHintMatchesManyFiles() , which