I have a cluster set up in Google Kubernetes Engine (GKE), with preemptible instances, TPU support, and 1 container per node.
When the container process errors out (e