Apache Spark 2.0 (PySpark) - DataFrame Error Multiple sources found for csv

↘锁芯ラ 提交于 2021-01-28 08:01:10

问题


I am trying to create a dataframe using the following code in Spark 2.0. While executing the code in Jupyter/Console, I am facing the below error. Can someone help me how to get rid of this error?

Error:

Py4JJavaError: An error occurred while calling o34.csv. : java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. at scala.sys.package$.error(package.scala:27)

Code:

   from pyspark.sql import SparkSession
   if __name__ == "__main__":
      session = SparkSession.builder.master('local')
                     .appName("RealEstateSurvey").getOrCreate()
      df = session \
           .read \
           .option("inferSchema", value = True) \
           .option('header','true') \
           .csv("/home/senthiljdpm/RealEstate.csv")

     print("=== Print out schema ===")
     session.stop()

回答1:


The error is because you must have both libraries (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat and com.databricks.spark.csv.DefaultSource) in your classpath. And spark got confused which one to choose.

All you need is tell spark to use com.databricks.spark.csv.DefaultSource by defining format option as

  df = session \
       .read \
       .format("com.databricks.spark.csv") \
       .option("inferSchema", value = True) \
       .option('header','true') \
       .csv("/home/senthiljdpm/RealEstate.csv")

Another alternative is to use load as

  df = session \
       .read \
       .format("com.databricks.spark.csv") \
       .option("inferSchema", value = True) \
       .option('header','true') \
       .load("/home/senthiljdpm/RealEstate.csv")



回答2:


If anyone faced a similar issue in Spark Java, it could be because you have multiple versions of the spark-sql jar in your classpath. Just FYI.



来源:https://stackoverflow.com/questions/50884599/apache-spark-2-0-pyspark-dataframe-error-multiple-sources-found-for-csv

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!