问题
I'm having difficulty understanding which would be best for my situation and how to actually implement it.
In a nutshell, the problem is this:
- I'm spinning up my DB (Postgres), BE (Django), and FE (React) deployments with Skaffold
- About 50% of the time the BE spins up before the DB
- One of the first things Django tries to do is connect to the DB
- It only tries once (by design and can't be changed), if it can't, it fails and the application is broken
- Thus, I need to make sure every single time I spin up my deployments, the DB deployment is running before starting the BE deployment
I came across readiness, liveness, and starup probes. I've read it a couple times and readiness probes sound like what I need: I don't want the BE deployment to start until the DB deployment is ready to accept connections.
I guess I'm not understanding how to set it up. This is what I've tried, but I still run into instances where one is being loaded before another.
postgres.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres-deployment
spec:
replicas: 1
selector:
matchLabels:
component: postgres
template:
metadata:
labels:
component: postgres
spec:
containers:
- name: postgres
image: testappcontainers.azurecr.io/postgres
ports:
- containerPort: 5432
env:
- name: POSTGRES_DB
valueFrom:
secretKeyRef:
name: testapp-secrets
key: PGDATABASE
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: testapp-secrets
key: PGUSER
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: testapp-secrets
key: PGPASSWORD
- name: POSTGRES_INITDB_ARGS
value: "-A md5"
volumeMounts:
- name: postgres-storage
mountPath: /var/lib/postgresql/data
subPath: postgres
volumes:
- name: postgres-storage
persistentVolumeClaim:
claimName: postgres-storage
---
apiVersion: v1
kind: Service
metadata:
name: postgres-cluster-ip-service
spec:
type: ClusterIP
selector:
component: postgres
ports:
- port: 1423
targetPort: 5432
api.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-deployment
spec:
replicas: 3
selector:
matchLabels:
component: api
template:
metadata:
labels:
component: api
spec:
containers:
- name: api
image: testappcontainers.azurecr.io/testapp-api
ports:
- containerPort: 5000
env:
- name: PGUSER
valueFrom:
secretKeyRef:
name: testapp-secrets
key: PGUSER
- name: PGHOST
value: postgres-cluster-ip-service
- name: PGPORT
value: "1423"
- name: PGDATABASE
valueFrom:
secretKeyRef:
name: testapp-secrets
key: PGDATABASE
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: testapp-secrets
key: PGPASSWORD
- name: SECRET_KEY
valueFrom:
secretKeyRef:
name: testapp-secrets
key: SECRET_KEY
- name: DEBUG
valueFrom:
secretKeyRef:
name: testapp-secrets
key: DEBUG
readinessProbe:
httpGet:
host: postgres-cluster-ip-service
port: 1423
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 2
---
apiVersion: v1
kind: Service
metadata:
name: api-cluster-ip-service
spec:
type: ClusterIP
selector:
component: api
ports:
- port: 5000
targetPort: 5000
client.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: client-deployment
spec:
replicas: 3
selector:
matchLabels:
component: client
template:
metadata:
labels:
component: client
spec:
containers:
- name: client
image: testappcontainers.azurecr.io/testapp-client
ports:
- containerPort: 3000
readinessProbe:
httpGet:
path: api-cluster-ip-service
port: 5000
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 2
---
apiVersion: v1
kind: Service
metadata:
name: client-cluster-ip-service
spec:
type: ClusterIP
selector:
component: client
ports:
- port: 3000
targetPort: 3000
I don't think the ingress.yaml
and the skaffold.yaml
will be helpful, but let me know if I should add those.
So what am I doing wrong here?
Edit:
So I've tried out a few things based on David Maze's response. This helped me understand what is going on better, but I am still running into issues I'm not quite understanding how to resolve.
The first problem is that even with a default restartPolicy: Always
, and even though Django fails, the Pods themselves don't fail. The Pods think they are perfectly healthy even though Django has failed.
The second problem is that apparently the Pods need to be made aware of Django's status. That is the part I'm not quite wrapping my brain around, particularly should probes be checking the status of other deployments or themselves?
Yesterday my thinking was the former, but today I'm thinking it is the latter: the Pod needs to know the program contained in it has failed. However, everything I've tried just results in a failed probe, connection refused, etc.:
# referring to itself
host: /health
port: 5000
host: /healthz
port: 5000
host: /api
port: 5000
host: /
port: 5000
host: /api-cluster-ip-service
port: 5000
host: /api-deployment
port: 5000
# referring to the DB deployment
host: /health
port: 1423 #or 5432
host: /healthz
port: 1423 #or 5432
host: /api
port: 1423 #or 5432
host: /
port: 1423 #or 5432
host: /postgres-cluster-ip-service
port: 1423 #or 5432
host: /postgres-deployment
port: 1423 #or 5432
So apparently I'm setting up the probe wrong, despite it being a "super-easy" implementation (as a few blogs have described it). For example, the /health
and /healthz
routes: are these built into Kubernetes or do these need to be setup? Rereading the docs to hopefully clarify this.
回答1:
You're just not waiting long enough.
The deployment artifacts you're showing here look pretty normal. It's even totally normal for your application to fail fast if it can't reach the database, say because it hasn't started up yet. Every pod has a restart policy, though, which defaults to Always
. So, when the pod fails, Kubernetes will restart it; and when it fails again, it will get restarted again; and when it keeps failing, Kubernetes will pause tens of seconds between restarts (the dreaded CrashLoopBackOff
state).
Eventually if you're in this wait-and-restart loop, the database will actually come up, and then Kubernetes will restart your application pods, at which point the application will start up normally.
The only thing that I'd change here is that your readiness probes for the two pods should probe the services themselves, not some other service. You probably want the path
to be something like /
or /healthz
or something else that is an actual HTTP request path in the service. That can return 503 Service Unavailable if it detects its dependency isn't available, or you can just crash. Just crashing is fine.
This is a totally normal setup in Kubernetes; there's no way to more directly say that pod A can't start until service B is ready. The flip side of this is that the pattern is actually pretty generic: if your application crashes and restarts whenever it can't reach its database, it doesn't matter if the database is hosted outside the cluster, or if it crashes sometime well after startup time; the same logic will try to restart your application until it works again.
回答2:
Actually, think I might have sorted it out.
Part of the problem is that even though restartPolicy: Always
is the default, the Pods are not aware the Django has failed so it thinks they are healthy.
My thinking was wrong in that I originally assumed I needed to refer to the DB deployment to see if it had start before starting the API deployment. Instead I needed to check if Django had failed and redeploy it if it had.
Doing the following accomplished this for me:
livenessProbe:
tcpSocket:
port: 5000
initialDelaySeconds: 2
periodSeconds: 2
readinessProbe:
tcpSocket:
port: 5000
initialDelaySeconds: 2
periodSeconds: 2
I'm learning Kubernetes so please correct me if there is a better way to do this or if this is just plain wrong. I just know it accomplishes what I want.
来源:https://stackoverflow.com/questions/59850959/setting-up-a-readiness-liveness-or-startup-probe