Apache Spark: pyspark crash for large dataset

后端 未结 5 1191
清歌不尽
清歌不尽 2021-01-01 20:48

I am new to Spark. and I have input file with training data 4000x1800. When I try to train this data (written python) get following error:

  1. 14/11/15 22:39:13

相关标签:
5条回答
  • 2021-01-01 21:00

    Mrutynjay,

    Though I do not have definitive answer. The issue looks like something related to the memory. I also encountered the same issue when trying to read a file of 5 MB. I deleted a portion of the file and and reduced to less than 1 MB and the code worked.

    I also found something on the same issue here in the below site as well.

    http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-td7691.html

    0 讨论(0)
  • 2021-01-01 21:01

    I got the same error,then i got an releated answer from pyspark process big datasets problems

    the solution is add some code python/pyspark/worker.py

    Add the following 2 lines to the end of the process function defined inside the main function

    for obj in iterator:
     pass
    

    so the process function now looks like this (in spark 1.5.2 at least):

     def process():
            iterator = deserializer.load_stream(infile)
            serializer.dump_stream(func(split_index, iterator), outfile)
            for obj in iterator:
                pass
    

    and this works for me.

    0 讨论(0)
  • 2021-01-01 21:03

    It's so simple.

    conf = SparkConf().setMaster("local").setAppName("RatingsHistogram") 
    sc = SparkContext(conf = conf) 
    lines = sc.textFile("file:///SparkCourse/filter_1.csv",2000) 
    print lines.first()
    

    while using sc.textfile add one more parameters for the number of divisions to a large value. The bigger the data the larger the value.

    0 讨论(0)
  • 2021-01-01 21:07

    I had a similar problem, I tried something like:

    numPartitions = a number for example 10 or 100 data = sc.textFile("myfile.txt",numPartitions)

    Inspired by: How to repartition evenly in Spark? or here: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html

    0 讨论(0)
  • 2021-01-01 21:11
    1. One possibility is that there is an exception in parsePoint, wrap the code in a try except block and print out the exception.
    2. Check your --driver-memory parameter, make it greater.
    0 讨论(0)
提交回复
热议问题