spotify-scio | 易学教程

Beam pipeline does not produce any output after GroupByKey with windowing and I got memory error

阅读更多关于 Beam pipeline does not produce any output after GroupByKey with windowing and I got memory error

问题 purpose: I want to load stream data, then add a key and then count them by key. problem: Apache Beam Dataflow pipline gets a memory error when i try to load and group-by-key a big-size data using streaming approach (unbounded data) . Because it seems that data is accumulated in group-by and it does not fire data earlier with triggering of each window. If I decrease the elements size (elements count will not change) it works! because actually group-by step waits for all the data to be grouped

Dataflow / apache beam Trigger window on number of bytes in window

阅读更多关于 Dataflow / apache beam Trigger window on number of bytes in window

问题 I have a simple job that moves data from pub sub to gcs. The pub sub topic is a shared topic with many different message types of varying size I want the result to be in GCS vertically partition accordingly: Schema/version/year/month/day/ under that parent key should be a group of files for that day, and the files should be a reasonable size, ie 10-200 mb Im using scio and i am able to a groupby operation to make a P/SCollection of [String, Iterable[Event]] where the key is based on the

Scio: groupByKey doesn't work when using Pub/Sub as collection source

阅读更多关于 Scio: groupByKey doesn't work when using Pub/Sub as collection source

问题 I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to work. sc.pubsubSubscription[String](psSubscription) .withFixedWindows(windowSize) // apply windowing logic .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue .withWindow[IntervalWindow] .swap .groupByKey .map { s => println("\n\n\n\n\n\n\n This

How to match multiple files with names using TextIO.Read in Cloud Dataflow

阅读更多关于 How to match multiple files with names using TextIO.Read in Cloud Dataflow

问题 I have a gcs folder as below: gs://<bucket-name>/<folder-name>/dt=2017-12-01/part-0000.tsv /dt=2017-12-02/part-0000.tsv /dt=2017-12-03/part-0000.tsv /dt=2017-12-04/part-0000.tsv ... I want to match only the files under dt=2017-12-02 and dt=2017-12-03 using sc.textFile() in Scio, which uses TextIO.Read.from() underneath as far as I know. I've tried gs://<bucket-name>/<folder-name>/dt={2017-12-02,2017-12-03}/*.tsv and gs://<bucket-name>/<folder-name>/dt=2017-12-(02|03)/*.tsv Both match zero

Why is my PCollection (SCollection) size so larged compared to BigQuery Table input size?

阅读更多关于 Why is my PCollection (SCollection) size so larged compared to BigQuery Table input size?

问题 The above image is the table schema for a big query table which is the input into an apache beam dataflow job that runs on spotify's scio. If you aren't familiar with scio it's a Scala wrapper around the Apache Beam Java SDK. In particular, a "SCollection wraps PCollection". My input table on BigQuery disk is 136 gigs, but upon looking at the size of my SCollection in the dataflow UI it is 504.91 GB. I understand that BigQuery is likely much better at data compression and representation, but

Read file in order in Google Cloud Dataflow

阅读更多关于 Read file in order in Google Cloud Dataflow

问题 I'm using Spotify Scio to read logs that are exported from Stackdriver to Google Cloud Storage. They are JSON files where every line is a single entry. Looking at the worker logs it seems like the file is split into chunks, which are then read in any order. I've already limited my job to exactly 1 worker in this case. Is there a way to force these chunks to be read and processed in order? As an example (textFile is basically a TextIO.Read): val sc = ScioContext(myOptions) sc.textFile(myFile)