How to read a nested collection in Spark

前端 未结 4 1267
借酒劲吻你
借酒劲吻你 2021-01-31 10:53

I have a parquet table with one of the columns being

, array>

Can run queries against this table in

4条回答
  •  死守一世寂寞
    2021-01-31 11:15

    There is no magic in the case of nested collection. Spark will handle the same way a RDD[(String, String)] and a RDD[(String, Seq[String])].

    Reading such nested collection from Parquet files can be tricky, though.

    Let's take an example from the spark-shell (1.3.1):

    scala> import sqlContext.implicits._
    import sqlContext.implicits._
    
    scala> case class Inner(a: String, b: String)
    defined class Inner
    
    scala> case class Outer(key: String, inners: Seq[Inner])
    defined class Outer
    

    Write the parquet file:

    scala> val outers = sc.parallelize(List(Outer("k1", List(Inner("a", "b")))))
    outers: org.apache.spark.rdd.RDD[Outer] = ParallelCollectionRDD[0] at parallelize at :25
    
    scala> outers.toDF.saveAsParquetFile("outers.parquet")
    

    Read the parquet file:

    scala> import org.apache.spark.sql.catalyst.expressions.Row
    import org.apache.spark.sql.catalyst.expressions.Row
    
    scala> val dataFrame = sqlContext.parquetFile("outers.parquet")
    dataFrame: org.apache.spark.sql.DataFrame = [key: string, inners: array>]   
    
    scala> val outers = dataFrame.map { row =>
         |   val key = row.getString(0)
         |   val inners = row.getAs[Seq[Row]](1).map(r => Inner(r.getString(0), r.getString(1)))
         |   Outer(key, inners)
         | }
    outers: org.apache.spark.rdd.RDD[Outer] = MapPartitionsRDD[8] at map at DataFrame.scala:848
    

    The important part is row.getAs[Seq[Row]](1). The internal representation of a nested sequence of struct is ArrayBuffer[Row], you could use any super-type of it instead of Seq[Row]. The 1 is the column index in the outer row. I used the method getAs here but there are alternatives in the latest versions of Spark. See the source code of the Row trait.

    Now that you have a RDD[Outer], you can apply any wanted transformation or action.

    // Filter the outers
    outers.filter(_.inners.nonEmpty)
    
    // Filter the inners
    outers.map(outer => outer.copy(inners = outer.inners.filter(_.a == "a")))
    

    Note that we used the spark-SQL library only to read the parquet file. You could for example select only the wanted columns directly on the DataFrame, before mapping it to a RDD.

    dataFrame.select('col1, 'col2).map { row => ... }
    

提交回复
热议问题