Checkpoint RDD ReliableCheckpointRDD has different number of partitions from original RDD

问题

I have a spark cluster of two machines and I when I run a spark streaming application I get the following errors :

Exception in thread "main" org.apache.spark.SparkException: Checkpoint RDD ReliableCheckpointRDD[11] at print at StatefulNetworkWordCount.scala:78(1) has different number of partitions from original RDD MapPartitionsRDD[10] at updateStateByKey at StatefulNetworkWordCount.scala:76(2)
    at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:73)
    at org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:74)

How can I give a checkpoint directory on file system that is not HDFS/Cassandra/any other data store ?

I have thought of two possible solutions, but I do not know how to code them :

have one remote directory which is local to both the workers
specify a remote directory on to both the workers

Any suggestions ?

回答1:

Ok, so I was able to go ahead with the first option.

I mounted a remote directory on all the workers as checkpoint and it worked perfectly.

How to mount the remote checkpoint directory on the workers:

sudo apt-get install sshfs
Load it to kernel

sudo modprobe fuse

sudo adduser username fuse

mkdir ~/checkpoint

sshfs ubuntu@xx.xx.x.xx:/home/ubuntu/checkpoint ~/checkpoint

来源：https://stackoverflow.com/questions/33238882/checkpoint-rdd-reliablecheckpointrdd-has-different-number-of-partitions-from-ori

标签

apache-spark

spark-streaming

apache-spark-ml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!