UC Berkeley JupyterHubs

  • Using DataHub
  • Contributing to DataHub
  • Common Administrator Tasks
  • Pre-requisites
  • Repository Structure
  • User home directory storage
  • Kubernetes Cluster Configuration
  • Cloud Credentials
  • Incident reports
    • 2017-02-09 - JupyterHub db manual overwrite
    • 2017-02-24 - Custom Autoscaler gonee haywire
    • 2017-02-24 - Proxy eviction strands user
    • 2017-03-06 - Non-matching hub image tags cause downtime
    • 2017-03-20 - Too many volumes per disk leave students stuck
    • 2017-03-23 - Weird upstream ipython bug kills kernels
    • 2017-04-03 - Custom autoscaler does not scale up when it should
    • 2017-05-09 - Oops we forgot to pay the bill
    • 2017-10-10 - Docker dies on a few Azure nodes
    • 2017-10-19 - Billing confusion with Azure portal causes summer hub to be lost
    • 2018-01-25 - Accidental merge to prod brings things down
    • 2018-01-26 - Hub starts up very slow, causing outage for users
    • 2018-02-06 - Azure PD refuses to detach, causing downtime for data100
    • 2018-02-28 - A node hangs, causing a subset of users to report issues
    • 2018-06-11 - Azure billing issue causes downtime
    • 2019-02-25 - Azure Kubernetes API Server outage causes downtime
    • 2019-05-01 - Service Account key leak incident
  • Common Administrator Tasks

Incident reportsΒΆ

Blameless incident reports are very important for long term sustainability of resilient infrastructure. We publish them here for transparency, and so we may learn from them for future incidents.

  • 2017-02-09 - JupyterHub db manual overwrite
  • 2017-02-24 - Custom Autoscaler gonee haywire
  • 2017-02-24 - Proxy eviction strands user
  • 2017-03-06 - Non-matching hub image tags cause downtime
  • 2017-03-20 - Too many volumes per disk leave students stuck
  • 2017-03-23 - Weird upstream ipython bug kills kernels
  • 2017-04-03 - Custom autoscaler does not scale up when it should
  • 2017-05-09 - Oops we forgot to pay the bill
  • 2017-10-10 - Docker dies on a few Azure nodes
  • 2017-10-19 - Billing confusion with Azure portal causes summer hub to be lost
  • 2018-01-25 - Accidental merge to prod brings things down
  • 2018-01-26 - Hub starts up very slow, causing outage for users
  • 2018-02-06 - Azure PD refuses to detach, causing downtime for data100
  • 2018-02-28 - A node hangs, causing a subset of users to report issues
  • 2018-06-11 - Azure billing issue causes downtime
  • 2019-02-25 - Azure Kubernetes API Server outage causes downtime
  • 2019-05-01 - Service Account key leak incident
Cloud Credentials 2017-02-09 - JupyterHub db manual overwrite

© Copyright 2019, Division of Data Sciences Technical Staff.
Created using Sphinx 3.5.1.