I am reading a text (not CSV) file that has header, content and footer using
Assuming your text file has JSON header and Footer, Spark SQL way,
Sample Data
Here the header can be avoided by following 3 lines (Assumption No Tilda in data),
jsonToCsvDF=spark.read.format("com.databricks.spark.csv").option("delimiter", "~").load(<Blob Path1/ ADLS Path1>)
spark.sql("SELECT SUBSTR(`_c0`,5,length(`_c0`)-5) FROM json_to_csv").coalesce(1).write.option("header",false).mode("overwrite").text(<Blob Path2/ ADLS Path2>)
Now the output will look like,
Hope it helps.
In addition to above answer, below solution fits good
for files with multiple
and footer
lines :-
val data_delimiter = "|"
val skipHeaderLines = 5
val skipHeaderLines = 3
//-- Read file into Dataframe and convert to RDD
val dataframe = spark.read.option("wholeFile", true).option("delimiter",data_delimiter).csv(s"hdfs://$in_data_file")
val rdd = dataframe.rdd
//-- RDD without header and footer
val dfRdd = rdd.zipWithIndex().filter({case (line, index) => index != (cnt - skipFooterLines) && index > (skipHeaderLines - 1)}).map({case (line, index) => line})
//-- Dataframe without header and footer
val df = spark.createDataFrame(dfRdd, dataframe.schema)
Hope this is helpful.
Sample data:
This is the footer line
Processing logic:
#load text file
txt = sc.textFile("path_to_above_sample_data_text_file.txt")
#remove header
header = txt.first()
txt = txt.filter(lambda line: line != header)
#remove footer
txt = txt.map(lambda line: line.split("|"))\
.filter(lambda line: len(line)>1)
#convert to dataframe
Output is:
|col1| col2| col3|
| 100|hello| asdf|
| 300| hi| abc|
| 200| bye| xyz|
| 800| ciao|qwerty|
Hope this helps!
assuming the file is not so large we can use collect to get the dataframe as iterator and the access the last element as follows:
df = df.collect()[data.count()-1]
avoid using collect
on large datasets.
we can use take to cut off the last row.
df = df.take(data.count()-1)