Checkpoint RDD ReliableCheckpointRDD has different number of partitions from original RDD

和自甴很熟 提交于 2019-12-12 08:55:09

问题


I have a spark cluster of two machines and I when I run a spark streaming application I get the following errors :

Exception in thread "main" org.apache.spark.SparkException: Checkpoint RDD ReliableCheckpointRDD[11] at print at StatefulNetworkWordCount.scala:78(1) has different number of partitions from original RDD MapPartitionsRDD[10] at updateStateByKey at StatefulNetworkWordCount.scala:76(2)
    at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:73)
    at org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:74)

How can I give a checkpoint directory on file system that is not HDFS/Cassandra/any other data store ?

I have thought of two possible solutions, but I do not know how to code them :

  1. have one remote directory which is local to both the workers

  2. specify a remote directory on to both the workers

Any suggestions ?


回答1:


Ok, so I was able to go ahead with the first option.

I mounted a remote directory on all the workers as checkpoint and it worked perfectly.

How to mount the remote checkpoint directory on the workers:

sudo apt-get install sshfs
Load it to kernel

sudo modprobe fuse

sudo adduser username fuse

mkdir ~/checkpoint

sshfs ubuntu@xx.xx.x.xx:/home/ubuntu/checkpoint ~/checkpoint


来源:https://stackoverflow.com/questions/33238882/checkpoint-rdd-reliablecheckpointrdd-has-different-number-of-partitions-from-ori

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!