I am processing an S3 folder containing csv.gz files in Spark. Each csv.gz file has a header that contains column names.
The way I load the contained data to Spark i
You could do something like:
val rdd = sc.textFile("s3://.../my-s3-path").mapPartitions(_.drop(1))
Because each input file is gzipped, it will be loaded under a separate partition. If we map across all partitions and drop the first line, we will consequently be removing the first line from each file.