Dealing with a large gzipped file in Spark

后端未结

关注

 3  1005

I have a large (about 85 GB compressed) gzipped file from s3 that I am trying to process with Spark on AWS EMR (right now with an m4.xlarge master instance and two m4.10xlar

相关标签:

3条回答

萌比男神i

2020-12-05 20:17
Spark can parallelize reading a single gzip file.

The best you can do split it in chunks that are gzipped.

However, Spark is really slow at reading gzip files. You can do this to speed it up:
```
file_names_rdd = sc.parallelize(list_of_files, 100)
lines_rdd = file_names_rdd.flatMap(lambda _: gzip.open(_).readlines())
```
Going through Python is twice has fast as reading the native Spark gzip reader.
0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2020-12-05 20:26
I have faced this problem and here is the solution.

Best way to approach this problem is to unzip the .gz file before our Spark batch run. Then use this unzip file, after that we can use Spark parallelism.

Code to unzip the .gz file.
```
import gzip
import shutil
with open('file.txt.gz', 'rb') as f_in, gzip.open('file.txt', 'wb') as f_out:
    shutil.copyfileobj(f_in, f_out)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
情书的邮戳

2020-12-05 20:35

If the file format is not splittable, then there's no way to avoid reading the file in its entirety on one core. In order to parallelize work, you have to know how to assign chunks of work to different computers. In the gzip case, suppose you divide it up into 128M chunks. The nth chunk depends on the n-1-th chunk's position information to know how to decompress, which depends on the n-2-nd chunk, and so on down to the first.

If you want to parallelize, you need to make this file splittable. One way is to unzip it and process it uncompressed, or you can unzip it, split it into several files (one file for each parallel task you want), and gzip each file.

0 讨论(0)
发布评论:

提交评论
- 加载中...