Why would I want .union over .unionAll in Spark for SchemaRDDs?

心不动则不痛 提交于 2019-12-08 15:07:15

问题


I'm trying to wrap my head around these two functions in the Spark SQL documentation–

  • def union(other: RDD[Row]): RDD[Row]

    Return the union of this RDD and another one.

  • def unionAll(otherPlan: SchemaRDD): SchemaRDD

    Combines the tuples of two RDDs with the same schema, keeping duplicates.

This is not the standard behavior of UNION vs UNION ALL, as documented in this SO question.

My code here, borrowing from the Spark SQL documentation, has the two functions returning the same results.

scala> case class Person(name: String, age: Int)
scala> import org.apache.spark.sql._
scala> val one = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2)))
scala> val two = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2),  Person("Gamma", 3)))
scala> val schemaString = "name age"
scala> val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
scala> val peopleSchemaRDD1 = sqlContext.applySchema(one, schema)
scala> val peopleSchemaRDD2 = sqlContext.applySchema(two, schema)
scala> peopleSchemaRDD1.union(peopleSchemaRDD2).collect
res34: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])
scala> peopleSchemaRDD1.unionAll(peopleSchemaRDD2).collect
res35: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])

Why would I prefer one over the other?


回答1:


In Spark 1.6, the above version of union was removed, so unionAll was all that remained.

In Spark 2.0, unionAll was renamed to union, with unionAll kept in for backward compatibility (I guess).

In any case, no deduplication is done in either union (Spark 2.0) or unionAll (Spark 1.6).




回答2:


unionAll() was deprecated in Spark 2.0, and for all future reference, union() is the only recommended method.

In either case, union or unionAll, both do not do a SQL style deduplication of data. In order to remove any duplicate rows, just use union() followed by a distinct().




回答3:


Judging from its type signature and (questionable) semantics I believe union() was vestigial.

The more modern DataFrame API offers only unionAll().



来源:https://stackoverflow.com/questions/29022530/why-would-i-want-union-over-unionall-in-spark-for-schemardds

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!