I am new to Spark. and I have input file with training data 4000x1800. When I try to train this data (written python) get following error:
14/11/15 22:39:13
Mrutynjay,
Though I do not have definitive answer. The issue looks like something related to the memory. I also encountered the same issue when trying to read a file of 5 MB. I deleted a portion of the file and and reduced to less than 1 MB and the code worked.
I also found something on the same issue here in the below site as well.
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-td7691.html
I got the same error,then i got an releated answer from pyspark process big datasets problems
the solution is add some code python/pyspark/worker.py
Add the following 2 lines to the end of the process function defined inside the main function
for obj in iterator:
pass
so the process function now looks like this (in spark 1.5.2 at least):
def process():
iterator = deserializer.load_stream(infile)
serializer.dump_stream(func(split_index, iterator), outfile)
for obj in iterator:
pass
and this works for me.
It's so simple.
conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
sc = SparkContext(conf = conf)
lines = sc.textFile("file:///SparkCourse/filter_1.csv",2000)
print lines.first()
while using sc.textfile
add one more parameters for the number of divisions to a large value.
The bigger the data the larger the value.
I had a similar problem, I tried something like:
numPartitions = a number for example 10 or 100 data = sc.textFile("myfile.txt",numPartitions)
Inspired by: How to repartition evenly in Spark? or here: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html
parsePoint
, wrap
the code in a try except
block and print out the exception.--driver-memory
parameter, make it greater.