How to read a nested collection in Spark

前端 未结 4 1269
借酒劲吻你
借酒劲吻你 2021-01-31 10:53

I have a parquet table with one of the columns being

, array>

Can run queries against this table in

4条回答
  •  旧时难觅i
    2021-01-31 11:27

    Above answers are all great answers and tackle this question from different sides; Spark SQL is also quite useful way to access nested data.

    Here's example how to use explode() in SQL directly to query nested collection.

    SELECT hholdid, tsp.person_seq_no 
    FROM (  SELECT hholdid, explode(tsp_ids) as tsp 
            FROM disc_mrt.unified_fact uf
         )
    

    tsp_ids is a nested of structs, which has many attributes, including person_seq_no which I'm selecting in the outer query above.

    Above was tested in Spark 2.0. I did a small test and it doesn't work in Spark 1.6. This question was asked when Spark 2 wasn't around, so this answer adds nicely to the list of available options to deal with nested structures.

    Have a look also on following JIRAs for Hive-compatible way to query nested data using LATERAL VIEW OUTER syntax, since Spark 2.2 also supports OUTER explode (e.g. when a nested collection is empty, but you still want to have attributes from a parent record):

    • SPARK-13721: Add support for LATERAL VIEW OUTER explode()

    Noticable not resolved JIRA on explode() for SQL access:

    • SPARK-7549: Support aggregating over nested fields

提交回复
热议问题