Incident reports#
Blameless incident reports are very important for long term sustainability of resilient infrastructure. We publish them here for transparency, and so we may learn from them for future incidents.
- 2017-02-09 - JupyterHub db manual overwrite
- 2017-02-24 - Custom Autoscaler gonee haywire
- 2017-02-24 - Proxy eviction strands user
- 2017-03-06 - Non-matching hub image tags cause downtime
- 2017-03-20 - Too many volumes per disk leave students stuck
- 2017-03-23 - Weird upstream ipython bug kills kernels
- 2017-04-03 - Custom autoscaler does not scale up when it should
- 2017-05-09 - Oops we forgot to pay the bill
- 2017-10-10 - Docker dies on a few Azure nodes
- 2017-10-19 - Billing confusion with Azure portal causes summer hub to be lost
- 2018-01-25 - Accidental merge to prod brings things down
- 2018-01-26 - Hub starts up very slow, causing outage for users
- 2018-02-06 - Azure PD refuses to detach, causing downtime for data100
- 2018-02-28 - A node hangs, causing a subset of users to report issues
- 2018-06-11 - Azure billing issue causes downtime
- 2019-02-25 - Azure Kubernetes API Server outage causes downtime
- 2019-05-01 - Service Account key leak incident