The importance of startup and liveness probes in Kubernetes

How to configure them properly— lessons learned from the real live project deployment problems

Przemysław Żochowski
10 min readJan 12, 2022
DevOps culture delivery circle

Once upon a time… the deployment story begins

The whole story began with the request to set up 3 separate environments for a new product being shipped for one of our clients. It was a classic application with front-end and back-end layer communicating over HTTP with the use of REST architectural style. The back-end was supposed to have a database connection to client’s database over public internet. Additionally, a decision was taken to use Keycloak to cover security and authentication management aspects. Further technicalities are irrelevant for this case, but counting other factors in, the team has decided the following :

  • 3 separate environments (dev, staging, prod) for front-end, back-end and keycloak services will have to be established, resulting in total 9 pods to be deployed and maintainedall had separate ingress and domain name created and assigned;
  • all environment will be deployed on the Google Kubernetes Engine cluster;

Moreover, it is worth to mention other circumstances and factors that came into play during the work on the deployment:

  • initially the cluster consisted of 5 quite powerful nodes;
  • there were lots of other containerized application in pods hosted on the nodes;
  • the above mentioned environments were brought up to life one after another over a couple of days;
  • the keycloak and backend pods were part of a single deployment, so during the deploy stage of pipeline if one of those was not deployed successfully the helm upgrade/install failed;
  • deployment for dev environment was triggered automatically with every merge to the main branch, whereas deployments to staging and prod were triggered on demand from the same branch. Changes were simply promoted to other environments when passed the tests in the previous one;

Along with the deployments, which I focused on, the project was under intense development due to the deadlines. This resulted in plenty of commits and merge requests to the main branch each day, consequently triggering plenty of deployments. Even the keycloak service was heavily customised.

The beginning of unexpected problems

The problems started when a decision was taken to decrease the number of nodes in the cluster from 5 to 3. The main reasoning behind it was to use node’s CPU and available memory in more effective way. With 5 nodes, the average CPU and memory usage of every node was not exceeding 20%, except for the occasional peaks. With the decreased amount of nodes a better usage of resources was expected. That was about the time, when our team was ready to deploy the production environment for the first time. After changes on the staging environment were accepted by the client, we were happy to promote the changes to the production one and deploy it for the first time.

I didn’t expect any problem as the helm template for the production environment did not differ from the staging one, when it comes to infrastructural aspects. Unlucky for me, Kubernetes scheduler has decided to place it on the same node, where the dev environment had been placed. It is not surprising though, since now we had only 3 nodes. Additionally, as mentioned earlier, there were other applications outside of our project, which were using the cluster resources. I was surprised to be informed by GitLab CI/CD that my deployment to production has failed. I rushed to k9s ( my favourite console tool to monitor the state of your Kubernetes cluster) to see what happened. I noticed that not only the production environment has not been deployed, now the keycloak on dev environment was not healthy and restarting.

Definitions of startup and liveness probes

Before we dive in to the rest of the story, I would like to present a few definitions, which I refer to a lot moving forward. You can skip this section if You are perfectly familiar with them.

Startup probe is a great choice for container that takes a long time to start or… when You have a few pods starting at the same time on the same node as a part of deployment and they are fighting for nodes’ resources. In such cases their startup might be slowed down.

Liveness probe, as the name suggests, is used to check whether our applications is still working properly or if it is stuck in some kind of a deadlock situation and needs to be restarted.

They work sequentially, meaning that the liveness probe is activated only after the startup probe has been tested successfully. For the reference see the diagram below.

The process of performing startup and liveness probe by Kubernetes when starting the pod (readiness probe not included for brevity)

What is great about the startup probe is that as soon as the first one passes the check, a liveness probe takes over to check the health of the container.

I didn’t mention and include on the diagram the readiness probe, which is used to suspend the incoming traffic to our container until it’s ready. It is a good choice for containers, which load the database schema with initial data or pull the data from some external service after startup in order to be able to work and respond to requests. One thing we have to keep in mind, which was kind of counter-intuitive for me at first, is that liveness probes do not wait for the readiness probes to succeed.

Monitoring, analysing, changing — a golden circle to overcome problems

I tried to deploy to production once again. This time however, constantly monitoring the situation with the use of k9s. The deployment has started on the same node as before, while the dev keycloak was still restarting (my mistake that I did not try to fix that beforehand). I could see the 2 new pods being brought to live (backend and keycloak), but after a while I got some insight and a general overview of the situation:

  • the backend pod, which was relatively simple Spring Boot application has restarted for the first time;
  • both pods were fighting for the node’s CPU and memory resources — consequently their startup took a lot more than expected;
  • the increased CPU and memory usage on the node, did not allow the dev keycloak to restart properly as its startup was slowed down as well;

It became instantly obvious that the liveness probe for backend application was failing, which was causing an endless restart loop during the deployment. Those were the liveness probe settings for the backend application:

livenessProbe:
initialDelaySeconds: 60
periodSeconds: 5
timeoutSeconds: 1
failureThreshold: 1
successThreshold: 1

Maybe one minute was not enough for the app to start (however on localhost it was up and running within a few seconds)? I was justifying the long startup process by 2 factors:

  • connection over public internet to the client’s database takes more time than on localhost;
  • the fight for the resources by the pods;

Ok, so how about 120 seconds of delay for the liveness probe? One minute might not be enough, but 2 should do the job.

I have fixed the problem with the keycloak dev environment. Subsequently I deployed the production environment once again with 2 minutes of delay for the liveness probe. After the startup the usage of resources went back to normal again (as seen below CPU usage is low). All environments were up and running again.

The console view of a cluster state in a selected namespace seen in the k9s tool

I reckoned that this is the end of the problems with the deployments. Nothing could be further from the truth.

The next morning

The next morning I sat down at my desk with a cup of coffee and checked how my deployments are doing. Opened the terminal, typed k9s and ka-boooom…

The status of my deployment the next morning, production and staging backend apps are down

My chain of thoughts:

  • good that I logged in early, client probably did not see the environment being down, I can fix this;
  • I need to some kind of alerting for such events, so I know first when the environment are going down;
  • 27 restarts over a single night — what on earth is going on?!!;

After putting down the fire, I have decided to address the alerting and monitoring issue, so that next time I am the first one to know, when such failure occurs and I am able to react quickly. A very handy service and one that is easy to setup is the one provided by the Google Monitoring Uptime Checks. Check out my other article, where I describe how to set up such monitoring and alerting service with the use of Slack.

The last, but the most important thing to address was to investigate and tackle the constant restart problems with the pods. It quickly became obvious to me that the liveness probe delay I set up the day before was not the only problem I had to deal with.

I have decided to go with the following approach, which worked very well:

  1. I have replaced all liveness and readiness probes command definitions with something that would always pass (like writing a file to the container), so that Kubernetes will never get any signal to restart the pods;
  2. After deployment I have observed how the 2 pods behave, how they fight for the node’s resources (CPU and memory), I observed the logs to see how much time does it really take for the pods to start and be ready to work. The conclusion was that for the Spring Boot application it took around 2 minutes to start (sometimes a bit more), because keycloak pod was stealing a lot of resources, while taking around 5 minutes to start itself;
  3. Based on my observations, I have defined the startup probe for both of the pods as following:
startupProbe:
initialDelaySeconds: 40
periodSeconds: 6
timeoutSeconds: 2
failureThreshold: 60

With this setting I would give the pod an initial 40 seconds of handicap before the first startup probe is invoked. Additionally, I gave the pod a maximum of 360 seconds to pass the startup probe (failureThreshold * periodSeconds). I added almost 2 minutes of handicap to the 5 minutes observed required for Keycloak startup. As mentioned at the beginning, a great thing about the startup probe is that as soon as the first one passes the check, a liveness probe takes over to check the health of the container. Thanks to that, it was safe for me to set up a ridiculously huge amount of time as a startup probe for the Spring Boot application. In case it spins up faster and passes the first startup probe, a liveness probe kicks in and takes over.

Another thing I did was to raise up the failure threshold for the liveness probe — from 1 to 3. I did that, so that in case there would be any temporary peak of usage of CPU by any of the pods hosted on the node, my application will not be a subject of restart after the first failed trial. That was basically my assumption of what happened during the night. There must have been a temporary resource usage peak on the node caused by other applications, resulting in a failed liveness probe of my containers.

4. I have redeployed all the applications and waited for the Google Monitoring uptime checks to do the job and inform me on a Slack channel in case any of the applications is a subject to failure. A few incidents occurred along the way, when the pod went down and was restarted, but thanks to the configuration of startup probe it had all the time need to restart properly.

Resources requests and limits

Of course, there were other things I have tried when dealing with the restart problems, but for the sake of brevity I have excluded from the main story.

It is worth mentioning, however, that there are many factors that come into play, when determining a successful, resilient Kubernetes deployments. For example, resources requests and limits assigned to the container. I have tampered with those as well. Especially the limits are quite tricky and need to be defined thoughtfully. I could surely devote another article to those, but a few things You have to remember are:

  • CPU is treated as a resource that can be compressed — if You impose limits on the CPU usage and Your application starts reaching them, Kubernetes will throttle the container, thus causing the application to slow down. In the worst case scenario it can result in a failure of the liveness probe, consequently terminating the pod(or restarting if it is a part of deployment or stateful set);
  • memory is not a subject of compression and if the container exceeds assigned memory limit, the pod is terminated (or restarted if it is a part of deployment or stateful set);

Conclusions

Liveness and startup probes along with the resources’ requests and limits are fundamental tools to tune up our pods. It can also be used to squeeze out more of our nodes or enable us to decrease the amount of nodes in our cluster. Defining a proper settings for those, in some scenarios, is not an easy task though and requires patience, monitoring and adjusting after releasing. The story presented in the article described the potential consequences of doing it the wrong way. It has also presented some ways of debugging and dealing with such problems in case they arise.

--

--

Przemysław Żochowski

Passionate Java full-stack developer. Strong advocate of the motto: "If you can't explain something in simple words, you don't understand it well enough"