I have a job that can take up to several hours. It is possible that for some reason (like out of memory, or cluster rebalance) it just fails. The problem is that the job is usua