问题
I would like to know if there is a way to find out total no rows in a file using google dataflow. Any code sample and pointer will be great help. Basically, I have a method as
int getCount(String fileName) {}
So, above method will return total count of rows and its implementation will be dataflow code.
Thanks
回答1:
Seems like your use case is one that doesn't require distributed processing, because the file is compressed and hence can not be read in parallel. However, you may still find it useful to use Dataflow APIs for the sake of their ease of access to GCS and automatic decompression.
Since you also want to get the result out of your pipeline as an actual Java object, you need to use the Direct runner, which runs in-process, without talking to the Dataflow service or doing any distributed processing, however in return it provides the ability to extract PCollection
's into Java objects:
Something like this:
PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
p.apply(TextIO.Read.from("gs://..."))
.apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);
来源:https://stackoverflow.com/questions/39237656/how-to-count-total-number-of-rows-in-a-file-using-google-dataflow