Spark SQL referencing attributes of UDT

后端 未结 1 720
予麋鹿
予麋鹿 2020-12-03 23:29

I am trying to implement a custom UDT and be able to reference it from Spark SQL (as explained in the Spark SQL whitepaper, section 4.4.2).

The real example is to h

相关标签:
1条回答
  • 2020-12-03 23:48

    You get this errors because schema defined by sqlType is never exposed and is not intended to be accessed directly. It simply provides a way to express a complex data types using native Spark SQL types.

    You can access individual attributes using UDFs but first lets show that the internal structure is indeed not exposed:

    dataFrame.printSchema
    // root
    //  |-- person_id: integer (nullable = true)
    //  |-- person: mockperso (nullable = true)
    

    To create UDF we need functions which take as an argument an object of a type represented by a given UDT:

    import org.apache.spark.sql.functions.udf
    
    val getFirstName = (person: MockPerson) => person.getFirstName
    val getLastName = (person: MockPerson) => person.getLastName
    val getAge = (person: MockPerson) => person.getAge
    

    which can be wrapped using udf function:

    val getFirstNameUDF = udf(getFirstName)
    val getLastNameUDF = udf(getLastName)
    val getAgeUDF = udf(getAge)
    
    dataFrame.select(
      getFirstNameUDF($"person").alias("first_name"),
      getLastNameUDF($"person").alias("last_name"),
      getAgeUDF($"person").alias("age")
    ).show()
    
    // +----------+---------+---+
    // |first_name|last_name|age|
    // +----------+---------+---+
    // |    First1|    Last1|  1|
    // |    First2|    Last2|  2|
    // +----------+---------+---+
    

    To use these with raw SQL you have register functions through SQLContext:

    sqlContext.udf.register("first_name", getFirstName)
    sqlContext.udf.register("last_name", getLastName)
    sqlContext.udf.register("age", getAge)
    
    sqlContext.sql("""
      SELECT first_name(person) AS first_name, last_name(person) AS last_name
      FROM person
      WHERE age(person) < 100""").show
    
    // +----------+---------+
    // |first_name|last_name|
    // +----------+---------+
    // |    First1|    Last1|
    // |    First2|    Last2|
    // +----------+---------+
    

    Unfortunately it comes with a price tag attached. First of all every operation requires deserialization. It also substantially limits the ways in which query can be optimized. In particular any join operation on one of these fields requires a Cartesian product.

    In practice if you want to encode a complex structure, which contains attributes that can be expressed using built-in types, it is better to use StructType:

    case class Person(first_name: String, last_name: String, age: Int)
    
    val df = sc.parallelize(
      (1 to 2).map(i => (i, Person(s"First$i", s"Last$i", i)))).toDF("id", "person")
    
    df.printSchema
    
    // root
    //  |-- id: integer (nullable = false)
    //  |-- person: struct (nullable = true)
    //  |    |-- first_name: string (nullable = true)
    //  |    |-- last_name: string (nullable = true)
    //  |    |-- age: integer (nullable = false)
    
    df
      .where($"person.age" < 100)
      .select($"person.first_name", $"person.last_name")
      .show
    
    // +----------+---------+
    // |first_name|last_name|
    // +----------+---------+
    // |    First1|    Last1|
    // |    First2|    Last2|
    // +----------+---------+
    

    and reserve UDTs for actual types extensions like built-in VectorUDT or things that can benefit from a specific representation like enumerations.

    0 讨论(0)
提交回复
热议问题