reading data from URL using spark databricks platform

前端 未结 2 1497
野趣味
野趣味 2021-02-09 17:20

trying to read data from url using spark on databricks community edition platform i tried to use spark.read.csv and using SparkFiles but still, i am missing some simple point

相关标签:
2条回答
  • Try this.

    url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv"
    from pyspark import SparkFiles
    spark.sparkContext.addFile(url)
    
    **df = spark.read.csv("file://"+SparkFiles.get("adult.csv"), header=True, inferSchema= True)**
    

    Just fetching few columns of your csv url.

    df.select("age","workclass","fnlwgt","education").show(10);
    >>> df.select("age","workclass","fnlwgt","education").show(10);
    +---+----------------+------+---------+
    |age|       workclass|fnlwgt|education|
    +---+----------------+------+---------+
    | 39|       State-gov| 77516|Bachelors|
    | 50|Self-emp-not-inc| 83311|Bachelors|
    | 38|         Private|215646|  HS-grad|
    | 53|         Private|234721|     11th|
    | 28|         Private|338409|Bachelors|
    | 37|         Private|284582|  Masters|
    | 49|         Private|160187|      9th|
    | 52|Self-emp-not-inc|209642|  HS-grad|
    | 31|         Private| 45781|  Masters|
    | 42|         Private|159449|Bachelors|
    +---+----------------+------+---------+
    

    SparkFiles get the absolute path of the file which is local to your driver or worker. That's the reason why it was not able to find it.

    0 讨论(0)
  • 2021-02-09 17:56

    Above answer works but might be error prone some times SparkFiles.get will return null

    #1 is more prominent way of getting a file from any url or public s3 location


    Option 1 :

    IOUtils.toString will do the trick see the docs of apache commons io jar will be already present in any spark cluster whether its databricks or any other spark installation.

    Below is the scala way of doing this... I have taken a raw git hub csv file for this example ... can change based on the requirements.

    import org.apache.commons.io.IOUtils // jar will be already there in spark cluster no need to worry
    import java.net.URL 
    
    val urlfile=new URL("https://raw.githubusercontent.com/lrjoshi/webpage/master/public/post/c159s.csv")
      val testcsvgit = IOUtils.toString(urlfile,"UTF-8").lines.toList.toDS()
      val testcsv = spark
                    .read.option("header", true)
                    .option("inferSchema", true)
                    .csv(testcsvgit)
      testcsv.show
    

    Result :

    +-----------+------+----+----+---+-----+
    |Experiment |Virus |Cell| MOI|hpi|Titer|
    +-----------+------+----+----+---+-----+
    |      EXP I| C159S|OFTu| 0.1|  0| 4.75|
    |      EXP I| C159S|OFTu| 0.1|  6| 2.75|
    |      EXP I| C159S|OFTu| 0.1| 12| 2.75|
    |      EXP I| C159S|OFTu| 0.1| 24|  5.0|
    |      EXP I| C159S|OFTu| 0.1| 48|  5.5|
    |      EXP I| C159S|OFTu| 0.1| 72|  7.0|
    |      EXP I| C159S| STU| 0.1|  0| 4.75|
    |      EXP I| C159S| STU| 0.1|  6| 3.75|
    |      EXP I| C159S| STU| 0.1| 12|  4.0|
    |      EXP I| C159S| STU| 0.1| 24| 3.75|
    |      EXP I| C159S| STU| 0.1| 48| 3.25|
    |      EXP I| C159S| STU| 0.1| 72| 3.25|
    |      EXP I| C159S|OFTu|10.0|  0|  6.5|
    |      EXP I| C159S|OFTu|10.0|  6| 4.75|
    |      EXP I| C159S|OFTu|10.0| 12| 4.75|
    |      EXP I| C159S|OFTu|10.0| 24| 6.25|
    |      EXP I| C159S|OFTu|10.0| 48|  6.5|
    |      EXP I| C159S|OFTu|10.0| 72|  7.0|
    |      EXP I| C159S| STU|10.0|  0|  7.0|
    |      EXP I| C159S| STU|10.0|  6| 4.75|
    +-----------+------+----+----+---+-----+
    only showing top 20 rows
    

    Option 2 : in Scala

    import java.net.URL
    import org.apache.spark.SparkFiles
    val urlfile="https://raw.githubusercontent.com/lrjoshi/webpage/master/public/post/c159s.csv"
    spark.sparkContext.addFile(urlfile)
    
    val df = spark.read
    .option("inferSchema", true)
    .option("header", true)
    .csv("file://"+SparkFiles.get("c159s.csv"))
    df.show
    

    Result : Will be same as Option #1

    0 讨论(0)
提交回复
热议问题