How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?

后端 未结 5 1509
自闭症患者
自闭症患者 2020-11-28 08:13

I have a large Excel(xlsx and xls) file with multiple sheet and I need convert it to RDD or Dataframe so that it can be joined to othe

相关标签:
5条回答
  • 2020-11-28 08:41

    Here are read and write examples to read from and write into excel with full set of options...

    Source spark-excel from crealytics

    Scala API Spark 2.0+:

    Create a DataFrame from an Excel file

        import org.apache.spark.sql._
    
    val spark: SparkSession = ???
    val df = spark.read
             .format("com.crealytics.spark.excel")
            .option("sheetName", "Daily") // Required
            .option("useHeader", "true") // Required
            .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
            .option("inferSchema", "false") // Optional, default: false
            .option("addColorColumns", "true") // Optional, default: false
            .option("startColumn", 0) // Optional, default: 0
            .option("endColumn", 99) // Optional, default: Int.MaxValue
            .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
            .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
            .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
            .schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
            .load("Worktime.xlsx")
    

    Write a DataFrame to an Excel file

        df.write
          .format("com.crealytics.spark.excel")
          .option("sheetName", "Daily")
          .option("useHeader", "true")
          .option("dateFormat", "yy-mmm-d") // Optional, default: yy-m-d h:mm
          .option("timestampFormat", "mm-dd-yyyy hh:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss.000
          .mode("overwrite")
          .save("Worktime2.xlsx")
    
    

    Note: Instead of sheet1 or sheet2 you can use their names as well.. in this example given above Daily is sheet name.

    • If you want to use it from spark shell...

    This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

        $SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.11:0.13.1
    
    
    • Dependencies needs to be added (in case of maven etc...):
    groupId: com.crealytics
    artifactId: spark-excel_2.11
    version: 0.13.1
    

    Further reading : See my article (How to do Simple reporting with Excel sheets using Apache Spark, Scala ?) of how to write in to excel file after an aggregations in to many excel sheets

    Tip : This is very useful approach particularly for writing maven test cases where you can place excel sheets with sample data in excel src/main/resources folder and you can access them in your unit test cases(scala/java), which creates DataFrame[s] out of excel sheet...

    • Another option you could consider is spark-hadoopoffice-ds

    A Spark datasource for the HadoopOffice library. This Spark datasource assumes at least Spark 2.0.1. However, the HadoopOffice library can also be used directly from Spark 1.x. Currently this datasource supports the following formats of the HadoopOffice library:

    Excel Datasource format: org.zuinnote.spark.office.Excel Loading and Saving of old Excel (.xls) and new Excel (.xlsx) This datasource is available on Spark-packages.org and on Maven Central.

    0 讨论(0)
  • 2020-11-28 08:44

    The solution to your problem is to use Spark Excel dependency in your project.

    Spark Excel has flexible options to play with.

    I have tested the following code to read from excel and convert it to dataframe and it just works perfect

    def readExcel(file: String): DataFrame = sqlContext.read
        .format("com.crealytics.spark.excel")
        .option("location", file)
        .option("useHeader", "true")
        .option("treatEmptyValuesAsNulls", "true")
        .option("inferSchema", "true")
        .option("addColorColumns", "False")
        .load()
    
    val data = readExcel("path to your excel file")
    
    data.show(false)
    

    you can give sheetname as option if your excel sheet has multiple sheets

    .option("sheetName", "Sheet2")
    

    I hope its helpful

    0 讨论(0)
  • 2020-11-28 08:47

    Alternatively, you can use the HadoopOffice library (https://github.com/ZuInnoTe/hadoopoffice/wiki), which supports also encrypted Excel documents and linked workbooks, amongst other features. Of course Spark is also supported.

    0 讨论(0)
  • 2020-11-28 09:05

    I have used com.crealytics.spark.excel-0.11 version jar and created in spark-Java, it would be the same in scala too, just need to change javaSparkContext to SparkContext.

    tempTable = new SQLContext(javaSparkContxt).read()
        .format("com.crealytics.spark.excel") 
        .option("sheetName", "sheet1")
        .option("useHeader", "false") // Required 
        .option("treatEmptyValuesAsNulls","false") // Optional, default: true 
        .option("inferSchema", "false") //Optional, default: false 
        .option("addColorColumns", "false") //Required
        .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff] .schema(schema)
        .schema(schema)
        .load("hdfs://localhost:8020/user/tester/my.xlsx");
    
    0 讨论(0)
  • 2020-11-28 09:07

    Hope this should help.

    val df_excel= spark.read.
                       format("com.crealytics.spark.excel").
                       option("useHeader", "true").
                       option("treatEmptyValuesAsNulls", "false").
                       option("inferSchema", "false"). 
                       option("addColorColumns", "false").load(file_path)
    
    display(df_excel)
    
    0 讨论(0)
提交回复
热议问题