A node hangs, causing a subset of users to report issues
Summary
On February 28, 2018, a handful of users reported on piazza that there servers wouldn’t start. It was determined that all problematic servers were running on the same node. After the node was cordoned and rebooted, the student servers were able to start properly.
Timeline
2018-02-28 21:21
Three students report problems starting their server on piazza and a GSI links to the reports on slack. More reports come in by 21:27.
21:30
The infrastructure team is alerted to the problem. The command kubectl --namespace=prod get pod -o wide | egrep -v -e prepull -e Running
shows that all non-running pods were scheduled on the same node. Most of the pods have an “Unknown” status while the rest are in “Terminating”. The oldest problematic pod is 29m.
21:34
The node k8s-pool1-19522833-9 is cordoned. It has a load of about 90 with no processes consuming much CPU. The node is rebooted via sysrq trigger. The hung pods remain stuck.
21:39
When the node comes back online, kubectl reports no more hung pods. Students are able to start their servers.
Conclusion
A problematic VM prevented nodes from launching pods. Once the VM was cordoned and rebooted, pods launch without trouble.
Action items
Process
- Monitor the cluster for non-running pods and send an alert if the count exceeds a threshold or if the non-running pods are clustered on the same node(s).