问题
I have a simple job that moves data from pub sub to gcs. The pub sub topic is a shared topic with many different message types of varying size
I want the result to be in GCS vertically partition accordingly:
Schema/version/year/month/day/
under that parent key should be a group of files for that day, and the files should be a reasonable size, ie 10-200 mb
Im using scio and i am able to a groupby operation to make a P/SCollection of [String, Iterable[Event]] where the key is based on the partitioning scheme above.
I am unable to use the default text sink since they do not support vertical partitioning, it can only write the entire pcollection to one location. Instead following the advice in the following answers:
How do I write to multiple files in Apache Beam?
Writing to Google Cloud Storage from PubSub using Cloud Dataflow using DoFn
I have created a simple function that writes my group to gcs.
object GcsWriter {
private val gcs: storage.Storage = StorageOptions.getDefaultInstance.getService
val EXTENSION = ".jsonl.gz"
//todo no idea if this is ok. see org.apache.beam.sdk.io.WriteFiles is a ptransform that writes text files and seems very complex
//maybe beam is aimed at a different use case
//this is an output 'transform' that writes text files
//org.apache.beam.sdk.io.TextIO.write().to("output")
def gzip(bytes: Array[Byte]): Array[Byte] = {
val byteOutputStream = new ByteArrayOutputStream()
val compressedStream = new GZIPOutputStream(byteOutputStream)
compressedStream.write(bytes)
compressedStream.close()
byteOutputStream.toByteArray
}
def writeAsTextToGcs(bucketName: String, key: String, items: Iterable[String]): Unit = {
val bytes = items.mkString(start = "",sep ="\n" ,end = "\n").getBytes("UTF-8")
val compressed = gzip(bytes)
val blobInfo = BlobInfo.newBuilder(bucketName, key + System.currentTimeMillis() + EXTENSION).build()
gcs.create(blobInfo, compressed)
}
}
This works and writes the files how i like i use the following triggering rules with fixed windows:
val WINDOW_DURATION: Duration = Duration.standardMinutes(10)
val WINDOW_ELEMENT_MAX_COUNT = 5000
val LATE_FIRING_DELAY: Duration = Duration.standardMinutes(10) //this is the time after receiving late data to refiring
val ALLOWED_LATENESS: Duration = Duration.standardHours(1)
val WINDOW_OPTIONS = WindowOptions(
trigger = AfterFirst.of(
ListBuffer(
AfterPane.elementCountAtLeast(WINDOW_ELEMENT_MAX_COUNT),
AfterWatermark.pastEndOfWindow().withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(LATE_FIRING_DELAY)))),
allowedLateness = ALLOWED_LATENESS,
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES
)
Basically a compound trigger of at the end of the window according to the watermark or when x elements are received.
The problem is this that the source data can have messages of varying size. So if i choose a fixed number of elements to trigger on i will either:
1) choose a too big number, for the larger events groups it will blow up the java heap on the worker 2) choose a smaller number, then i end up with some tiny files for the quiet events where i would want to accumulate more events in my file.
I dont see a custom trigger where i can pass a lambda which outputs the metric on each element or something like that. Is there a way i can implement my own trigger to trigger on the number of bytes in the window.
Some other questions
Am i correct in assuming the Iterator for the elements in each group is in memory not streamed from storage? If not i could stream from the iterator to gcs in a more memory efficient way
For my GCS writer i am simply doing it in a map or a ParDo. It doesn't not implement the File output sink or look anything like TextIo. Is there going to be issues with this simple implementation. in the docs it says that if a transform throws an exception it is simply retried (indefinately for streaming apps)
来源:https://stackoverflow.com/questions/46428605/dataflow-apache-beam-trigger-window-on-number-of-bytes-in-window