问题
I am new to spark I want to remove header and last line from a csv file
Notes xyz
"id","member_id"
"60045257","63989975",
"60981766","65023535",
Total amount:4444228900
Total amount: 133826689
I want to remove line Notes xyz ,Total amount:4444228900 and Total amount: 133826689 from the file .I have removed the first line from the file
val dfRetail = sc.textFile("file:////home/cloudera/Projects/Project3/test/test_3.csv");
var header=dfRetail.first();
var final_data=dfRetail.filter(row => row!=header);
How to remove the last lines ?
回答1:
Use zipWithIndex and then filter by index:
val total = dfRetail.count();
val withoutFooter = dfRetail.zipWithIndex()
.filter(x => x._2 < total - 3)
.map (x => x._1)
It will map each line to pair (line, index). Then you filter this RDD, selecting only those with index lower than total number of objects - 3 - so without footer. When you map it to only first element of tuple, so for line of document
You can also use mapPartitionsWithIndex:
val withoutFooter = dfRetail.mapPartitionsWithIndex { (idx, iter) =>
val size = iter.size();
if (idx == noOfTotalPartitions) {
iter.take(size - 3)
}
else iter
});
It's working in the same way, but may be faster
来源:https://stackoverflow.com/questions/42763806/spark-how-to-remove-last-line-in-a-csv-file