Invalid status code '400' from .. error payload: "requirement failed: Session isn't active

后端 未结 2 604
[愿得一人]
[愿得一人] 2021-01-06 23:05

I am running Pyspark scripts to write a dataframe to a csv in jupyter Notebook as below:

df.coalesce(1).write.csv(\'Data1.csv\',header = \'true\')

相关标签:
2条回答
  • 2021-01-06 23:48

    I am not well versed in pyspark but in scala the solution would involve something like this

    First we need to create a method for creating a header file:

    def createHeaderFile(headerFilePath: String, colNames: Array[String]) {
    
    //format header file path
    val fileName = "dfheader.csv"
    val headerFileFullName = "%s/%s".format(headerFilePath, fileName)
    
    //write file to hdfs one line after another
    val hadoopConfig = new Configuration()
    val fileSystem = FileSystem.get(hadoopConfig)
    val output = fileSystem.create(new Path(headerFileFullName))
    val writer = new PrintWriter(output)
    
    for (h <- colNames) {
      writer.write(h + ",")
    }
    writer.write("\n")
    writer.close()
    
    }
    

    You will also need a method for calling hadoop to merge your part files which would be written by df.write method:

    def mergeOutputFiles(sourcePaths: String, destLocation: String): Unit = {
    
    val hadoopConfig = new Configuration()
    val hdfs = FileSystem.get(hadoopConfig)
    // in case of array[String] use   for loop to iterate over the muliple source paths  if not use the code below 
    //   for (sourcePath <- sourcePaths) {
      //Get the path under destination where the partitioned files are temporarily stored
      val pathText = sourcePaths.split("/")
      val destPath = "%s/%s".format(destLocation, pathText.last)
    
      //Merge files into 1
      FileUtil.copyMerge(hdfs, new Path(sourcePath), hdfs, new Path(destPath), true, hadoopConfig, null)
     // }
    //delete the temp partitioned files post merge complete
    val tempfilesPath = "%s%s".format(destLocation, tempOutputFolder)
    hdfs.delete(new Path(tempfilesPath), true)
    }
    

    Here is a method for generating output files or your df.write method where you are passing your huge DF to be written out to hadoop HDFS:

    def generateOutputFiles( processedDf: DataFrame, opPath: String, tempOutputFolder: String,
                           spark: SparkSession): String = {
    
      import spark.implicits._
    
      val fileName = "%s%sNameofyourCsvFile.csv".format(opPath, tempOutputFolder)
      //write as csv to output directory and add file path to array to be sent for merging and create header file
      processedDf.write.mode("overwrite").csv(fileName)
    
      createHeaderFile(fileName, processedDf.columns)
      //create an array of the partitioned file paths
    
      outputFilePathList = fileName
    
      // you can use array of string or string only depending on  if the output needs to get divided in multiple file based on some parameter  in that case chagne the signature ot Array[String] as output
      // add below code 
      // outputFilePathList(counter) = fileName
      // just use a loop in the above  and increment it 
      //counter += 1
    
      return outputFilePathList
    }
    

    With all the methods defined here is how you can implement them:

    def processyourlogic( your parameters  if any):Dataframe=
    {
    // your logic to do whatever needs to be done to your data
    }
    

    Assuming the above method returns a dataframe, here is how you can put everything together:

    val yourbigD f = processyourlogic(your parameters) // returns DF
    yourbigDf.cache // caching just in case you need it 
    val outputPathFinal = " location where you want your file to be saved"
    val tempOutputFolderLocation = "temp/"
    val partFiles = generateOutputFiles(yourbigDf, outputPathFinal, tempOutputFolderLocation, spark)
    mergeOutputFiles(partFiles, outputPathFinal)
    

    Let me know if you have any other question relating to that. If the answer you seek is different then the original question should be asked.

    0 讨论(0)
  • 2021-01-07 00:02

    Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.

    If that's the case, here's how to address it:

    1. edit the /etc/livy/conf/livy.conf file (in the cluster's master node)
    2. set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
    3. restart Livy to update the setting: sudo restart livy-server in the cluster's master
    4. test your code again
    0 讨论(0)
提交回复
热议问题