PySpark serialization EOFError

笑着哭i 提交于 2019-11-30 17:01:06

The error appears to happen in the pySpark read_int function. Code for which is as follows from spark site :

def read_int(stream):
length = stream.read(4)
if not length:
    raise EOFError
return struct.unpack("!i", length)[0]

This would mean that when reading 4bytes from the stream, if 0 bytes are read, EOF error is raised. The python docs are here.

Have you checked to see where in your code the EOError is arising?

My guess would be that it's coming as you attempt to define df with, since that's the only place in your code that the file is actually trying to be read.

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
     inferschema='true').load('myfile.csv')

At every point after this line, your code is working with the variable df, not the file itself, so it would seem likely that this line is generating the error.

A simple way to test if this is the case would be to comment out the rest of your code, and/or place a line like this right after the line above.

print(len(df))

Another way would be to use a try loop, like:

try:
    df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
     inferschema='true').load('myfile.csv')
except:
    print('Didn't load file into df!')

If it turns out that that line is the one generating the EOFError, then you're never getting the dataframes in the first place, so attempting to reduce them won't make a difference.

If that is the line generating the error, two possibilities come to mind:

1) Your code is calling one or both of the .csv files earlier on, and isn't closing it prior to this line. If so, simply close it above your code here.

2) There's something wrong with the .csv files themselves. Try loading them outside of this code, and see if you can get them into memory properly in the first place, using something like csv.reader, and manipulate them in ways you'd expect.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!