Using AWS Glue to convert very big csv.gz (30-40 gb each) to parquet

后端 未结 2 1286
醉梦人生
醉梦人生 2021-01-24 07:19

There are lots of such questions but nothing seems to help. I am trying to covert quite large csv.gz files to parquet and keep on getting various errors like

\'C         


        
相关标签:
2条回答
  • 2021-01-24 07:51

    How many DPUs you are using. This article gives a nice overview of DPU capacity planning. Hope that helps. There is no definite rulebook from AWS stating how much DPU you need to process a particular size.

    0 讨论(0)
  • 2021-01-24 07:55

    I think the problem isn't directly connected to the number of DPUs. You have large file and you are using GZIP format which it’s not splittable therefore you have this problem.

    I suggest to convert your file from GZIP to bzip2 or lz4. Additionaly you should consider to use partitioning of output data for better performance in the future.

    http://comphadoop.weebly.com/

    0 讨论(0)
提交回复
热议问题