How to read a nested collection in Spark

前端 未结 4 1249
借酒劲吻你
借酒劲吻你 2021-01-31 10:53

I have a parquet table with one of the columns being

, array>

Can run queries against this table in

相关标签:
4条回答
  • 2021-01-31 11:15

    There is no magic in the case of nested collection. Spark will handle the same way a RDD[(String, String)] and a RDD[(String, Seq[String])].

    Reading such nested collection from Parquet files can be tricky, though.

    Let's take an example from the spark-shell (1.3.1):

    scala> import sqlContext.implicits._
    import sqlContext.implicits._
    
    scala> case class Inner(a: String, b: String)
    defined class Inner
    
    scala> case class Outer(key: String, inners: Seq[Inner])
    defined class Outer
    

    Write the parquet file:

    scala> val outers = sc.parallelize(List(Outer("k1", List(Inner("a", "b")))))
    outers: org.apache.spark.rdd.RDD[Outer] = ParallelCollectionRDD[0] at parallelize at <console>:25
    
    scala> outers.toDF.saveAsParquetFile("outers.parquet")
    

    Read the parquet file:

    scala> import org.apache.spark.sql.catalyst.expressions.Row
    import org.apache.spark.sql.catalyst.expressions.Row
    
    scala> val dataFrame = sqlContext.parquetFile("outers.parquet")
    dataFrame: org.apache.spark.sql.DataFrame = [key: string, inners: array<struct<a:string,b:string>>]   
    
    scala> val outers = dataFrame.map { row =>
         |   val key = row.getString(0)
         |   val inners = row.getAs[Seq[Row]](1).map(r => Inner(r.getString(0), r.getString(1)))
         |   Outer(key, inners)
         | }
    outers: org.apache.spark.rdd.RDD[Outer] = MapPartitionsRDD[8] at map at DataFrame.scala:848
    

    The important part is row.getAs[Seq[Row]](1). The internal representation of a nested sequence of struct is ArrayBuffer[Row], you could use any super-type of it instead of Seq[Row]. The 1 is the column index in the outer row. I used the method getAs here but there are alternatives in the latest versions of Spark. See the source code of the Row trait.

    Now that you have a RDD[Outer], you can apply any wanted transformation or action.

    // Filter the outers
    outers.filter(_.inners.nonEmpty)
    
    // Filter the inners
    outers.map(outer => outer.copy(inners = outer.inners.filter(_.a == "a")))
    

    Note that we used the spark-SQL library only to read the parquet file. You could for example select only the wanted columns directly on the DataFrame, before mapping it to a RDD.

    dataFrame.select('col1, 'col2).map { row => ... }
    
    0 讨论(0)
  • 2021-01-31 11:25

    Another approach would be using pattern matching like this:

    val rdd: RDD[(String, List[(String, String)]] = dataFrame.map(_.toSeq.toList match { 
      case List(key: String, inners: Seq[Row]) => key -> inners.map(_.toSeq.toList match {
        case List(a:String, b: String) => (a, b)
      }).toList
    })
    

    You can pattern match directly on Row but it is likely to fail for a few reasons.

    0 讨论(0)
  • 2021-01-31 11:27

    Above answers are all great answers and tackle this question from different sides; Spark SQL is also quite useful way to access nested data.

    Here's example how to use explode() in SQL directly to query nested collection.

    SELECT hholdid, tsp.person_seq_no 
    FROM (  SELECT hholdid, explode(tsp_ids) as tsp 
            FROM disc_mrt.unified_fact uf
         )
    

    tsp_ids is a nested of structs, which has many attributes, including person_seq_no which I'm selecting in the outer query above.

    Above was tested in Spark 2.0. I did a small test and it doesn't work in Spark 1.6. This question was asked when Spark 2 wasn't around, so this answer adds nicely to the list of available options to deal with nested structures.

    Have a look also on following JIRAs for Hive-compatible way to query nested data using LATERAL VIEW OUTER syntax, since Spark 2.2 also supports OUTER explode (e.g. when a nested collection is empty, but you still want to have attributes from a parent record):

    • SPARK-13721: Add support for LATERAL VIEW OUTER explode()

    Noticable not resolved JIRA on explode() for SQL access:

    • SPARK-7549: Support aggregating over nested fields
    0 讨论(0)
  • 2021-01-31 11:29

    I'll give a Python-based answer since that's what I'm using. I think Scala has something similar.

    The explode function was added in Spark 1.4.0 to handle nested arrays in DataFrames, according to the Python API docs.

    Create a test dataframe:

    from pyspark.sql import Row
    
    df = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3]), Row(a=2, intlist=[4,5,6])])
    df.show()
    
    ## +-+--------------------+
    ## |a|             intlist|
    ## +-+--------------------+
    ## |1|ArrayBuffer(1, 2, 3)|
    ## |2|ArrayBuffer(4, 5, 6)|
    ## +-+--------------------+
    

    Use explode to flatten the list column:

    from pyspark.sql.functions import explode
    
    df.select(df.a, explode(df.intlist)).show()
    
    ## +-+---+
    ## |a|_c0|
    ## +-+---+
    ## |1|  1|
    ## |1|  2|
    ## |1|  3|
    ## |2|  4|
    ## |2|  5|
    ## |2|  6|
    ## +-+---+
    
    0 讨论(0)
提交回复
热议问题