Streaming data to Google Cloud Storage from PubSub using Cloud Dataflow

纵然是瞬间 提交于 2020-01-15 10:06:21

问题


I am listening to data from pub-sub using streaming data in dataflow. Then I need to upload to storage, process the data and upload it to bigquery.

here is my code:

public class BotPipline {

public static void main(String[] args) {

    DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
    options.setRunner(BlockingDataflowPipelineRunner.class);
    options.setProject(MY_PROJECT);
    options.setStagingLocation(MY_STAGING_LOCATION);
    options.setStreaming(true);

    Pipeline pipeline = Pipeline.create(options);

    PCollection<String> input = pipeline.apply(PubsubIO.Read.maxNumRecords(1).subscription(MY_SUBSCRIBTION));

    input.apply(TextIO.Write.to(MY_STORAGE_LOCATION));

    input
    .apply(someDataProcessing(...)).named("update json"))
    .apply(convertToTableRow(...)).named("convert json to table row"))
            .apply(BigQueryIO.Write.to(MY_BQ_TABLE).withSchema(tableSchema)
    );
    pipeline.run();
}

}

when I run the code commenting the Writing to storage the code works well. but when I try uploading to big query I get this error (which is expected..):

Write can only be applied to a Bounded PCollection

I am not using bound since I need to run this all the time and I need the data to be uploaded immediately . Any solution?

EDIT: this my desired behavior:

I am receiving messages via pubsub. Each message should be stored in its own file in GCS as rough data, execute some processing on the data, and then save it to big query- having the file name in the data.

Data should be seen immediately after received in BQ example :

data published to pubsub : {a:1, b:2} 
data saved to GCS file UUID: A1F432 
data processing :  {a:1, b:2} -> 
                   {a:11, b: 22} -> 
                   {fileName: A1F432, data: {a:11, b: 22}} 
data in BQ : {fileName: A1F432, data: {a:11, b: 22}} 

the idea is that the processed data is stored in BQ having a link to the Rough data stored in GCS


回答1:


Currently we don't support writing unbounded collections in TextIO.Write. See related question.

Could you clarify what you would like the behavior of unbounded TextIO.Write to be? E.g. would you like to have one constantly growing file, or one file per window, closed when the window closes, or something else, or does it only matter to you that the total contents of the files written will eventually contain all the PubSub messages but it doesn't matter how the files are structured, etc?

As a workaround, you can implement writing to GCS as your own DoFn, using IOChannelFactory to interact with GCS (in fact, TextIO.Write is, under the hood, just a composite transform that a user could have written themselves from scratch).

You can access the window of the data using the optional BoundedWindow parameter on @ProcessElement. I'd be able to provide more advice if you explain the desired behavior.



来源:https://stackoverflow.com/questions/40384870/streaming-data-to-google-cloud-storage-from-pubsub-using-cloud-dataflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!