A node hangs, causing a subset of users to report issues

Published

February 28, 2018

Summary

On February 28, 2018, a handful of users reported on piazza that there servers wouldn’t start. It was determined that all problematic servers were running on the same node. After the node was cordoned and rebooted, the student servers were able to start properly.

Timeline

2018-02-28 21:21

Three students report problems starting their server on piazza and a GSI links to the reports on slack. More reports come in by 21:27.

21:30

The infrastructure team is alerted to the problem. The command kubectl --namespace=prod get pod -o wide | egrep -v -e prepull -e Running shows that all non-running pods were scheduled on the same node. Most of the pods have an “Unknown” status while the rest are in “Terminating”. The oldest problematic pod is 29m.

21:34

The node k8s-pool1-19522833-9 is cordoned. It has a load of about 90 with no processes consuming much CPU. The node is rebooted via sysrq trigger. The hung pods remain stuck.

21:39

When the node comes back online, kubectl reports no more hung pods. Students are able to start their servers.

Conclusion

A problematic VM prevented nodes from launching pods. Once the VM was cordoned and rebooted, pods launch without trouble.

Action items

Process

  1. Monitor the cluster for non-running pods and send an alert if the count exceeds a threshold or if the non-running pods are clustered on the same node(s).