Reading nested JSON in Google Dataflow / Apache Beam

前端 未结 1 1701
遥遥无期
遥遥无期 2021-01-13 16:57

It is possible to read unnested JSON files on Cloud Storage with Dataflow via:

p.apply(\"read logfiles\", TextIO.Read.from(\"gs://bucket/*\").withCoder(Tabl         


        
相关标签:
1条回答
  • 2021-01-13 17:31

    Your best bet is probably to do what you described in #2 and use Jackson directly. It makes the most sense to let the TextIO read do what it is built for -- reading lines from a file with the string coder -- and then use a DoFn to actually parse the elements. Something like the following:

    PCollection<String> lines = pipeline
      .apply(TextIO.from("gs://bucket/..."));
    PCollection<TableRow> objects = lines
      .apply(ParDo.of(new DoFn<String, TableRow>() {
        @Override
        public void processElement(ProcessContext c) {
          String json = c.element();
          SomeObject object = /* parse json using Jackson, etc. */;
          TableRow row = /* create a table row from object */;
          c.output(row);
        }
      });
    

    Note that you could also do this using multiple ParDos.

    0 讨论(0)
提交回复
热议问题