How do I compare each column in a table using DataFrame by Scala

前端 未结 2 1872
独厮守ぢ
独厮守ぢ 2021-01-27 05:59

There are two tables; one is ID Table 1 and the other is Attribute Table 2.

Table 1

Table 2

If the IDs the same row in Table 1 has same

相关标签:
2条回答
  • 2021-01-27 06:31

    This should do the trick

    import spark.implicits._
    
    val t1 = List(
      ("id1","id2"),
      ("id1","id3"),
      ("id2","id3")
    ).toDF("id_x", "id_y")
    
    val t2 = List(
      ("id1","blue","m"),
      ("id2","red","s"),
      ("id3","blue","s")
    ).toDF("id", "color", "size")
    
    t1
      .join(t2.as("x"), $"id_x" === $"x.id", "inner")
      .join(t2.as("y"), $"id_y" === $"y.id", "inner")
      .select(
        'id_x,
        'id_y,
        when($"x.color" === $"y.color",1).otherwise(0).alias("color").cast(IntegerType),
        when($"x.size" === $"y.size",1).otherwise(0).alias("size").cast(IntegerType)
      )
      .show()
    

    Resulting in:

    +----+----+-----+----+
    |id_x|id_y|color|size|
    +----+----+-----+----+
    | id1| id2|    0|   0|
    | id1| id3|    1|   0|
    | id2| id3|    0|   1|
    +----+----+-----+----+
    
    0 讨论(0)
  • 2021-01-27 06:38

    Here is how you can do it using UDF which helps you to understand, how ever the repetition of code and be minimized to increase the performance

    import spark.implicits._
    
    val df1 = spark.sparkContext.parallelize(Seq(
        ("id1", "id2"),
        ("id1","id3"),
        ("id2","id3")
      )).toDF("idA", "idB")
    
    val df2 = spark.sparkContext.parallelize(Seq(
      ("id1", "blue", "m"),
      ("id2", "red", "s"),
      ("id3", "blue", "s")
    )).toDF("id", "color", "size")
    
    val firstJoin = df1.join(df2, df1("idA") === df2("id"), "inner")
      .withColumnRenamed("color", "colorA")
      .withColumnRenamed("size", "sizeA")
      .withColumnRenamed("id", "idx")
    
    val secondJoin = firstJoin.join(df2, firstJoin("idB") === df2("id"), "inner")
    
    val check = udf((v1: String, v2:String ) => {
      if (v1.equalsIgnoreCase(v2)) 1 else 0
    })
    
    val result = secondJoin
      .withColumn("color", check(col("colorA"), col("color")))
      .withColumn("size", check(col("sizeA"), col("size")))
    
    val finalResult = result.select("idA", "idB", "color", "size")
    

    Hope this helps!

    0 讨论(0)
提交回复
热议问题