spotify-scio

Beam pipeline does not produce any output after GroupByKey with windowing and I got memory error

自闭症网瘾萝莉.ら 提交于 2021-02-07 08:39:47
问题 purpose: I want to load stream data, then add a key and then count them by key. problem: Apache Beam Dataflow pipline gets a memory error when i try to load and group-by-key a big-size data using streaming approach (unbounded data) . Because it seems that data is accumulated in group-by and it does not fire data earlier with triggering of each window. If I decrease the elements size (elements count will not change) it works! because actually group-by step waits for all the data to be grouped

Dataflow / apache beam Trigger window on number of bytes in window

老子叫甜甜 提交于 2020-06-27 15:14:52
问题 I have a simple job that moves data from pub sub to gcs. The pub sub topic is a shared topic with many different message types of varying size I want the result to be in GCS vertically partition accordingly: Schema/version/year/month/day/ under that parent key should be a group of files for that day, and the files should be a reasonable size, ie 10-200 mb Im using scio and i am able to a groupby operation to make a P/SCollection of [String, Iterable[Event]] where the key is based on the

Scio: groupByKey doesn't work when using Pub/Sub as collection source

人盡茶涼 提交于 2019-12-23 22:15:42
问题 I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to work. sc.pubsubSubscription[String](psSubscription) .withFixedWindows(windowSize) // apply windowing logic .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue .withWindow[IntervalWindow] .swap .groupByKey .map { s => println("\n\n\n\n\n\n\n This

How to match multiple files with names using TextIO.Read in Cloud Dataflow

橙三吉。 提交于 2019-12-23 18:28:01
问题 I have a gcs folder as below: gs://<bucket-name>/<folder-name>/dt=2017-12-01/part-0000.tsv /dt=2017-12-02/part-0000.tsv /dt=2017-12-03/part-0000.tsv /dt=2017-12-04/part-0000.tsv ... I want to match only the files under dt=2017-12-02 and dt=2017-12-03 using sc.textFile() in Scio, which uses TextIO.Read.from() underneath as far as I know. I've tried gs://<bucket-name>/<folder-name>/dt={2017-12-02,2017-12-03}/*.tsv and gs://<bucket-name>/<folder-name>/dt=2017-12-(02|03)/*.tsv Both match zero

Why is my PCollection (SCollection) size so larged compared to BigQuery Table input size?

北战南征 提交于 2019-12-13 03:32:23
问题 The above image is the table schema for a big query table which is the input into an apache beam dataflow job that runs on spotify's scio. If you aren't familiar with scio it's a Scala wrapper around the Apache Beam Java SDK. In particular, a "SCollection wraps PCollection". My input table on BigQuery disk is 136 gigs, but upon looking at the size of my SCollection in the dataflow UI it is 504.91 GB. I understand that BigQuery is likely much better at data compression and representation, but

Read file in order in Google Cloud Dataflow

房东的猫 提交于 2019-12-12 05:48:31
问题 I'm using Spotify Scio to read logs that are exported from Stackdriver to Google Cloud Storage. They are JSON files where every line is a single entry. Looking at the worker logs it seems like the file is split into chunks, which are then read in any order. I've already limited my job to exactly 1 worker in this case. Is there a way to force these chunks to be read and processed in order? As an example (textFile is basically a TextIO.Read): val sc = ScioContext(myOptions) sc.textFile(myFile)