join two dataframe without having common column spark, scala

纵然是瞬间 提交于 2019-12-01 13:49:11
Shankar Koirala

monotonically_increasing_id() is increasing and unique but not consecutive.

You can use zipWithIndex by converting to rdd and reconstructing Dataframe with the same schema for both dataframe.

import spark.implicits._


val df1 = Seq(
  ("karti", "9685684551", 24),
  ("raja", "8595456552", 22)
).toDF("Customer_name", "Customer_phone", "Customer_age")


val df2 = Seq(
  ("watch", 1),
  ("cattoy", 2)
).toDF("Order_name", "Order_ID")

val df11 = spark.sqlContext.createDataFrame(
  df1.rdd.zipWithIndex.map {
    case (row, index) => Row.fromSeq(row.toSeq :+ index)
  },
  // Create schema for index column
  StructType(df1.schema.fields :+ StructField("index", LongType, false))
)


val df22 = spark.sqlContext.createDataFrame(
  df2.rdd.zipWithIndex.map {
    case (row, index) => Row.fromSeq(row.toSeq :+ index)
  },
  // Create schema for index column
  StructType(df2.schema.fields :+ StructField("index", LongType, false))
)

Now join the final dataframes

df11.join(df22, Seq("index")).drop("index")

Output:

+-------------+--------------+------------+----------+--------+
|Customer_name|Customer_phone|Customer_age|Order_name|Order_ID|
+-------------+--------------+------------+----------+--------+
|karti        |9685684551    |24          |watch     |1       |
|raja         |8595456552    |22          |cattoy    |2       |
+-------------+--------------+------------+----------+--------+

add an index column to both dataframe using the below code

df1.withColumn("id1",monotonicallyIncreasingId)
df2.withColumn("id2",monotonicallyIncreasingId)

then join both the dataframes using the below code and drop the index column

df1.join(df2,col("id1")===col("id2"),"inner")
   .drop("id1","id2")
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!