Azure PD refuses to detach, causing downtime for data100
Summary
On February 5, 2018, a PR was merged into the production cluster for Data 100. The CI got as far as running helm upgrade but the hub’s persistent volume would not detach from the old hub. The new hub pod had to wait on the hub-db-dir volume and so would not start. The persistent volume claim was ultimately deleted. The subsequent helm upgrade created a new volume and a new hub pod was able to start.
Timeline
2018-02-05 20:57
A PR for data100 is merged.
21:20
Towards the end of the build, the upgrade fails because the new hub pod does not start up. The hub-db-dir volume remains bound to the old hub pod which is stuck in a Terminating state. The hub pod only completes termination when delete is passed a grace period of 0. The hub volume remains bound however.
21:30
CI is restarted but by the time helm is run, the hub-db-dir volume remains bound and cannot be attached to the new hub pod. Additionally, helm errors because the jupyterhub-internal ingress object cannot be found even though it does exist.
21:45
Since it cannot be determined what node the volume is bound to, the volume is deleted. The jupyterhub-internal ingress object is also deleted prior to restarting the CI build.
22:05
The hub comes up with a new hub-db-dir volume. CI fails due to the same jupyterhub-internal object error.
Conclusion
Azure was not able to detach the hub-db-dir azure disk from the hub pod. The pvc is deleted and the hub comes up on the next CI run.
Action items
Process
- Store the hub db in a cloud database to eliminate reliance on hub volume.
- Downgrade helm to 2.6.x to see if this fixes the helm upgrades.