问题
I have a spark cluster of two machines and I when I run a spark streaming application I get the following errors :
Exception in thread "main" org.apache.spark.SparkException: Checkpoint RDD ReliableCheckpointRDD[11] at print at StatefulNetworkWordCount.scala:78(1) has different number of partitions from original RDD MapPartitionsRDD[10] at updateStateByKey at StatefulNetworkWordCount.scala:76(2)
at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:73)
at org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:74)
How can I give a checkpoint directory on file system that is not HDFS/Cassandra/any other data store ?
I have thought of two possible solutions, but I do not know how to code them :
have one remote directory which is local to both the workers
specify a remote directory on to both the workers
Any suggestions ?
回答1:
Ok, so I was able to go ahead with the first option.
I mounted a remote directory on all the workers as checkpoint and it worked perfectly.
How to mount the remote checkpoint directory on the workers:
sudo apt-get install sshfs
Load it to kernel
sudo modprobe fuse
sudo adduser username fuse
mkdir ~/checkpoint
sshfs ubuntu@xx.xx.x.xx:/home/ubuntu/checkpoint ~/checkpoint
来源:https://stackoverflow.com/questions/33238882/checkpoint-rdd-reliablecheckpointrdd-has-different-number-of-partitions-from-ori