Bypass first line of each file in Spark (Scala)

后端 未结 1 771
温柔的废话
温柔的废话 2021-01-06 17:09

I am processing an S3 folder containing csv.gz files in Spark. Each csv.gz file has a header that contains column names.

The way I load the contained data to Spark i

1条回答
  •  逝去的感伤
    2021-01-06 17:50

    You could do something like:

    val rdd = sc.textFile("s3://.../my-s3-path").mapPartitions(_.drop(1))
    

    Because each input file is gzipped, it will be loaded under a separate partition. If we map across all partitions and drop the first line, we will consequently be removing the first line from each file.

    0 讨论(0)
提交回复
热议问题