How to read a nested collection in Spark

前端 未结 4 1250
借酒劲吻你
借酒劲吻你 2021-01-31 10:53

I have a parquet table with one of the columns being

, array>

Can run queries against this table in

4条回答
  •  遥遥无期
    2021-01-31 11:29

    I'll give a Python-based answer since that's what I'm using. I think Scala has something similar.

    The explode function was added in Spark 1.4.0 to handle nested arrays in DataFrames, according to the Python API docs.

    Create a test dataframe:

    from pyspark.sql import Row
    
    df = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3]), Row(a=2, intlist=[4,5,6])])
    df.show()
    
    ## +-+--------------------+
    ## |a|             intlist|
    ## +-+--------------------+
    ## |1|ArrayBuffer(1, 2, 3)|
    ## |2|ArrayBuffer(4, 5, 6)|
    ## +-+--------------------+
    

    Use explode to flatten the list column:

    from pyspark.sql.functions import explode
    
    df.select(df.a, explode(df.intlist)).show()
    
    ## +-+---+
    ## |a|_c0|
    ## +-+---+
    ## |1|  1|
    ## |1|  2|
    ## |1|  3|
    ## |2|  4|
    ## |2|  5|
    ## |2|  6|
    ## +-+---+
    

提交回复
热议问题