Apache Spark: pyspark crash for large dataset

后端未结

关注

 5  1191

I am new to Spark. and I have input file with training data 4000x1800. When I try to train this data (written python) get following error:

14/11/15 22:39:13

相关标签:

5条回答

南笙

2021-01-01 21:00

Mrutynjay,

Though I do not have definitive answer. The issue looks like something related to the memory. I also encountered the same issue when trying to read a file of 5 MB. I deleted a portion of the file and and reduced to less than 1 MB and the code worked.

I also found something on the same issue here in the below site as well.

http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-td7691.html

0 讨论(0)
发布评论:

提交评论
- 加载中...
予麋鹿

2021-01-01 21:01
I got the same error,then i got an releated answer from pyspark process big datasets problems

the solution is add some code python/pyspark/worker.py

Add the following 2 lines to the end of the process function defined inside the main function
```
for obj in iterator:
 pass
```
so the process function now looks like this (in spark 1.5.2 at least):
```
 def process():
        iterator = deserializer.load_stream(infile)
        serializer.dump_stream(func(split_index, iterator), outfile)
        for obj in iterator:
            pass
```
and this works for me.
0 讨论(0)
发布评论:

提交评论
- 加载中...
鱼传尺愫

2021-01-01 21:03
It's so simple.
```
conf = SparkConf().setMaster("local").setAppName("RatingsHistogram") 
sc = SparkContext(conf = conf) 
lines = sc.textFile("file:///SparkCourse/filter_1.csv",2000) 
print lines.first()
```
while using sc.textfile add one more parameters for the number of divisions to a large value. The bigger the data the larger the value.
0 讨论(0)
发布评论:

提交评论
- 加载中...
有刺的猬

2021-01-01 21:07

I had a similar problem, I tried something like:

numPartitions = a number for example 10 or 100 data = sc.textFile("myfile.txt",numPartitions)

Inspired by: How to repartition evenly in Spark? or here: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦谈多话

2021-01-01 21:11
1. One possibility is that there is an exception in parsePoint, wrap the code in a try except block and print out the exception.
2. Check your --driver-memory parameter, make it greater.
0 讨论(0)
发布评论:

提交评论
- 加载中...