How to write spark dataframe in a single file in local system without using coalesce

↘锁芯ラ 提交于 2021-01-27 21:21:22

问题


I want to generate an avro file from a pyspark dataframe and currently I am doing coalesce as below

df = df.coalesce(1)
df.write.format('avro').save('file:///mypath')

But this is leading to memory issues now as all the data will be fetched to memory before writing and my data size is growing consistently everyday. So I want to write the data by each partition so that the data would be written to disk in chunks and doesnot raise OOM issues. I found that toLocalIterator helps in achieving this. But I am not sure how to use it. I tried the below usage and it returns all rows

iter = df.toLocalIterator()
for i in iter:
    print('writing some data')
    # write the data into disk/file

The iter is iterating over each row rather than each partition. How should I do this?


回答1:


when you do df = df.coalesce(1) all the data is collected into one of the worker nodes. if that node cannot handle such huge due to resource constraints on the node then the job will fail with OOM error.

As per spark documentation toLocalIterator Returns an iterator that contains all rows in this current Dataset and Max Memory It can consume is equivalent to largest partition in this Dataset

How toLocalIterator works?

The first partition is sent to the driver. If you continue to iterate and reach the end of the first partition, the second partition will be sent to the driver node and so on continuous till last partition.. so that is why (max memory it can occupy = largest partition) make sure your master node has sufficient ram and disk.

toLocalIterator.next() method makes sure to pull next partition records if previous partition processing is done.

what you can do is 

    //batch objects like 1000 per batch
    df.toLocalIterator().foreach(obj => {
      //add object in array
      //if batch size is reached ... 
        //then serialize them and use FileOutputStream and save in local location 
    })

note: make sure to cache your parentDF .. otherwise in some scenarios every partition needs to be recomputed.



来源:https://stackoverflow.com/questions/63574983/how-to-write-spark-dataframe-in-a-single-file-in-local-system-without-using-coal

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!