How to Remove header and footer from Dataframe?

后端 未结 4 943
小蘑菇
小蘑菇 2021-01-24 07:23

I am reading a text (not CSV) file that has header, content and footer using

spark.read.format(\"text\").option(\"delimiter\",\"|\")...load(file)
相关标签:
4条回答
  • 2021-01-24 08:11

    Assuming your text file has JSON header and Footer, Spark SQL way,

    Sample Data

    {"":[{<field_name>:<field_value1>},{<field_name>:<field_value2>}]}
    

    Here the header can be avoided by following 3 lines (Assumption No Tilda in data),

    jsonToCsvDF=spark.read.format("com.databricks.spark.csv").option("delimiter", "~").load(<Blob Path1/ ADLS Path1>)
    
    jsonToCsvDF.createOrReplaceTempView("json_to_csv")
    
    spark.sql("SELECT SUBSTR(`_c0`,5,length(`_c0`)-5) FROM json_to_csv").coalesce(1).write.option("header",false).mode("overwrite").text(<Blob Path2/ ADLS Path2>)
    

    Now the output will look like,

    [{<field_name>:<field_value1>},{<field_name>:<field_value2>}]
    

    Hope it helps.

    0 讨论(0)
  • 2021-01-24 08:16

    In addition to above answer, below solution fits good for files with multiple header and footer lines :-

    val data_delimiter = "|"
    val skipHeaderLines = 5
    val skipHeaderLines = 3
    
    //-- Read file into Dataframe and convert to RDD
    val dataframe = spark.read.option("wholeFile", true).option("delimiter",data_delimiter).csv(s"hdfs://$in_data_file")
    
    val rdd = dataframe.rdd
    
    //-- RDD without header and footer
    val dfRdd = rdd.zipWithIndex().filter({case (line, index) => index != (cnt - skipFooterLines) && index > (skipHeaderLines - 1)}).map({case (line, index) => line})
    
    //-- Dataframe without header and footer
    val df = spark.createDataFrame(dfRdd, dataframe.schema)
    

    Hope this is helpful.

    0 讨论(0)
  • 2021-01-24 08:24

    Sample data:

    col1|col2|col3
    100|hello|asdf
    300|hi|abc
    200|bye|xyz
    800|ciao|qwerty
    This is the footer line
    

    Processing logic:

    #load text file
    txt = sc.textFile("path_to_above_sample_data_text_file.txt")
    
    #remove header
    header = txt.first()
    txt = txt.filter(lambda line: line != header)
    
    #remove footer
    txt = txt.map(lambda line: line.split("|"))\
        .filter(lambda line: len(line)>1)
    
    #convert to dataframe
    df=txt.toDF(header.split("|"))
    df.show()
    

    Output is:

    +----+-----+------+
    |col1| col2|  col3|
    +----+-----+------+
    | 100|hello|  asdf|
    | 300|   hi|   abc|
    | 200|  bye|   xyz|
    | 800| ciao|qwerty|
    +----+-----+------+
    


    Hope this helps!

    0 讨论(0)
  • 2021-01-24 08:27

    assuming the file is not so large we can use collect to get the dataframe as iterator and the access the last element as follows:

    df = df.collect()[data.count()-1]
    

    avoid using collect on large datasets.

    or

    we can use take to cut off the last row.

    df = df.take(data.count()-1)
    
    0 讨论(0)
提交回复
热议问题