How to count total number of rows in a file using google dataflow

不羁岁月 提交于 2021-02-08 06:54:18

问题


I would like to know if there is a way to find out total no rows in a file using google dataflow. Any code sample and pointer will be great help. Basically, I have a method as

int getCount(String fileName) {}

So, above method will return total count of rows and its implementation will be dataflow code.

Thanks


回答1:


Seems like your use case is one that doesn't require distributed processing, because the file is compressed and hence can not be read in parallel. However, you may still find it useful to use Dataflow APIs for the sake of their ease of access to GCS and automatic decompression.

Since you also want to get the result out of your pipeline as an actual Java object, you need to use the Direct runner, which runs in-process, without talking to the Dataflow service or doing any distributed processing, however in return it provides the ability to extract PCollection's into Java objects:

Something like this:

PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
    p.apply(TextIO.Read.from("gs://..."))
     .apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);


来源:https://stackoverflow.com/questions/39237656/how-to-count-total-number-of-rows-in-a-file-using-google-dataflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!