Defining DataFrame Schema for a table with 1500 columns in Spark

后端 未结 3 1462
轻奢々
轻奢々 2021-01-23 02:29

I have a table with around 1500 columns in SQL Server. I need to read the data from this table and then convert it to proper datatype format and then insert the records into Ora

相关标签:
3条回答
  • 2021-01-23 02:44

    The options for reading a table with 1500 columns

    1) Using Case class

    Case class would not work because its limited to 22 fields( for scala version < 2.11).

    2) Using StructType

    You can use the StructType to define the schema and create the dataframe.

    Third option

    You can use the Spark-csv package . In this, you can use .option("inferschema","true"). This will automatically read the schema from the file.

    0 讨论(0)
  • 2021-01-23 02:51

    You can have your schema with hundreds of columns in the json format. And then you can read this json file to construct you custom schema.

    For example, Your schema json be:

    [
        {
            "columnType": "VARCHAR",
            "columnName": "NAME",
            "nullable": true
        },
        {
            "columnType": "VARCHAR",
            "columnName": "AGE",
            "nullable": true
        },
        .
        .
        .
    ]
    

    Now you can read the the json to parse it to some case class to form the StructType.

    case class Field(name: String, dataType: String, nullable: Boolean)
    

    You can create a Map to have spark DataTypes corresponding to column Type Oracle string in json schema.

    val dataType = Map(
       "VARCHAR" -> StringType,
       "NUMERIC" -> LongType,
       "TIMESTAMP" -> TimestampType,
       .
       .
       .
    )
    
    def parseJsonForSchema(jsonFilePath: String) = {
       val jsonString = Source.fromFile(jsonFilePath).mkString
       val parsedJson = parse(jsonString)
       val fields = parsedJson.extract[Field]
       val schemaColumns = fields.map(field => StructField(field.name, getDataType(field), field.nullable))
       StructType(schemaColumns)
    }
    
    0 讨论(0)
  • 2021-01-23 02:55

    For this type of requirements. I'd offer case class approach to prepare a dataframe

    Yes, There are some limitations like productarity but we can overcome... you can do like below example for < versions 2.11 :

    prepare a case class which extends Product and overrides methods.

    like...

    • productArity():Int: This returns the size of the attributes. In our case, it's 33. So, our implementation looks like this:

    • productElement(n:Int):Any: Given an index, this returns the attribute. As protection, we also have a default case, which throws an IndexOutOfBoundsException exception:

    • canEqual (that:Any):Boolean: This is the last of the three functions, and it serves as a boundary condition when an equality check is being done against class:


    • Example implementation you can refer this Student case class which has 33 fields in it
    • Example student dataset description here

    Another option :

    Use the StructType to define the schema and create the dataframe.(if you don't want to use spark csv api)

    0 讨论(0)
提交回复
热议问题