apache-beam-io | 易学教程

How to log incoming messages in apache beam pipeline

阅读更多关于 How to log incoming messages in apache beam pipeline

问题 I am writing a simple apache beam streaming pipeline, taking input from a pubsub topic and storing this into bigquery. For hours I thought I am not able to even read a message, as I was simply trying to log the input to console: events = p | 'Read PubSub' >> ReadFromPubSub(subscription=SUBSCRIPTION) logging.info(events) When I write this to text it works fine! However my call to the logger never happens. How to people develop / debug these streaming pipelines? I have tried adding the

Dataflow GroupBy -> multiple outputs based on keys

阅读更多关于 Dataflow GroupBy -> multiple outputs based on keys

问题 Is there any simple way that I can redirect the output of GroupBy into multiple output files based on Group keys? Bin.apply(GroupByKey.<String, KV<Long,Iterable<TableRow>>>create()) .apply(ParDo.named("Print Bins").of( ... ) .apply(TextIO.Write.to(*Output file based on key*)) If Sink is the solution, would you please share a sample code w/ me? Thanks! 回答1: Beam 2.2 will include an API to do just that - TextIO.write().to(DynamicDestinations) , see source. For now, if you'd like to use this API

How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

阅读更多关于 How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

问题 I am using Apache Beam 2.6 to read from a single Kafka topic and write the output to Google Cloud Storage (GCS). Now I want to alter the pipeline so that it is reading multiple topics and writing them out as gs://bucket/topic/... When reading only a single topic I used TextIO in the last step of my pipeline: TextIO.write() .to( new DateNamedFiles( String.format("gs://bucket/data%s/", suffix), currentMillisString)) .withWindowedWrites() .withTempDirectory( FileBasedSink

How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

阅读更多关于 How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths?

JDBC Fetch from oracle with Beam

阅读更多关于 JDBC Fetch from oracle with Beam

问题 The below program is to connect to Oracle 11g and fetch the records. How ever it is giving me NullPointerException for the coder at pipeline.apply(). I have added the ojdbc14.jar to the project dependencies. public static void main(String[] args) { Pipeline p = Pipeline.create(PipelineOptionsFactory.create()); p.apply(JdbcIO.<KV<Integer, String>>read() .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create( "oracle.jdbc.driver.OracleDriver", "jdbc:oracle:thin:@hostdnsname:port

Using CoGroupByKey with custom type ends up in a Coder error

阅读更多关于 Using CoGroupByKey with custom type ends up in a Coder error

问题 I want to join two PCollection (from a different input respectively) and implement by following the step described here, "Joins with CoGroupByKey" section: https://cloud.google.com/dataflow/model/group-by-key In my case, I want to join GeoIP's "block" information and "location" information. So I defined Block and Location as a custom class and then wrote like below: final TupleTag<Block> t1 = new TupleTag<Block>(); final TupleTag<Location> t2 = new TupleTag<Location>(); PCollection<KV<Long,

Using CoGroupByKey with custom type ends up in a Coder error

阅读更多关于 Using CoGroupByKey with custom type ends up in a Coder error

Using defaultNaming for dynamic windowed writes in Apache Beam

阅读更多关于 Using defaultNaming for dynamic windowed writes in Apache Beam

问题 I am following along with answer to this post and the documentation in order to perform a dynamic windowed write on my data at the end of a pipeline. Here is what I have so far: static void applyWindowedWrite(PCollection<String> stream) { stream.apply( FileIO.<String, String>writeDynamic() .by(Event::getKey) .via(TextIO.sink()) .to("gs://some_bucket/events/") .withNaming(key -> defaultNaming(key, ".json"))); } But NetBeans warns me about a syntax error on the last line: FileNaming is not

How to speedup bulk importing into google cloud datastore with multiple workers?

阅读更多关于 How to speedup bulk importing into google cloud datastore with multiple workers?

问题 I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the datastore sink. The workflow works fine but the cons I noticed is that: The write speed into datastore is at most around 25-30 entities per second. I tried to use --autoscalingAlgorithm=THROUGHPUT_BASED --numWorkers=10 --maxNumWorkers=100 but the execution seems to prefer one worker (see graph below:

How can I improve performance of TextIO or AvroIO when reading a very large number of files?

阅读更多关于 How can I improve performance of TextIO or AvroIO when reading a very large number of files?

问题 TextIO.read() and AvroIO.read() (as well as some other Beam IO's) by default don't perform very well in current Apache Beam runners when reading a filepattern that expands into a very large number of files - for example, 1M files. How can I read such a large number of files efficiently? 回答1: When you know in advance that the filepattern being read with TextIO or AvroIO is going to expand into a large number of files, you can use the recently added feature .withHintMatchesManyFiles() , which