发表新帖

发表新帖

Bypass first line of each file in Spark (Scala)

后端未结

关注

 1  771

温柔的废话 2021-01-06 17:09

I am processing an S3 folder containing csv.gz files in Spark. Each csv.gz file has a header that contains column names.

The way I load the contained data to Spark i

1条回答

逝去的感伤 (楼主)

2021-01-06 17:50
You could do something like:
```
val rdd = sc.textFile("s3://.../my-s3-path").mapPartitions(_.drop(1))
```
Because each input file is gzipped, it will be loaded under a separate partition. If we map across all partitions and drop the first line, we will consequently be removing the first line from each file.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题