# 2017-03-23 - Weird upstream ipython bug kills kernels

## Summary ##

A seemingly unrelated change caused user kernels to die on start (making notebook execution impossible) for newly started user servers from about Mar 22 19:30 to Mar 23 09:45. Most users didn't see any errors until start of class at about 9AM, since they were running servers that were previously started.

## Timeline ##

### March 22, around 19:30 ###
A deployment is performed, finally deploying https://github.com/data-8/jupyterhub-k8s/pull/146 to production. It seemed to work fine on -dev, and on prod as well. However, the testing regimen was only to see if a notebook server would show up - not if a kernel would spawn.

### Mar 23, 09:08  ###
Students report that their kernels keep dying. This is confirmed to be a problem for all newly launched notebooks, in both prod and dev.

### 09:16 ###

The last change to the repo (an update of the single-user image) is reverted, to check if that was causing the problem. This does not improve the situation. Debugging continues, but with no obvious angles of attack.

### 09:41 ###
After debugging produces no obvious culprits, the state of the entire infrastructure for prod is reverted to a known good state from a few days ago. This was done with:

```bash
./deploy.py prod data8 25abea764121953538713134e8a08e0291813834
```

`25abea764121953538713134e8a08e0291813834` is the commit hash of a known good commit from March 19. Our disciplined adherence to immutable & reproducible deployment paid off, and we were able to restore new servers to working order with this!

Students are now able to resume working after a server restart. A mass restart is also performed to aid this.

Dev is left in a broken state in an attempt to debug.

### 09:48  ###
A core Jupyter Notebook dev at BIDS attempts to debug the problem, since it seems to be with the notebook itself and not with JupyterHub.

### 11:08 ###
Core Jupyter Notebook dev confirms that this makes no sense.

### 14:55  ###
Attempts to isolate the bug start again, mostly by using `git bisect` to deploy different versions of our infrastructure to dev until we find what broke.

### 15:30 ###
https://github.com/data-8/jupyterhub-k8s/pull/146 is identified as the culprit. It continues to not make sense.

### 17:25 ###
A very involved and laborious revert of the offending part of the patch is done in https://github.com/jupyterhub/kubespawner/pull/37. Core Jupyter Notebook dev continues to confirm this makes no sense.

https://github.com/data-8/jupyterhub-k8s/pull/152 is also merged, and deployed shortly after verifiying that everything (including starting kernels & executing code) works fine on dev. Deployed to prod and everything is fine.

## Conclusion ##

Insufficient testing procedures caused a new kind of outage (kernel dying) that we had not seen before. However, since our infrastructure was immutable & reproducible, our outage really only lasted about 40 minutes (from start of lab when students were starting containers until the revert). Deeper debugging produced a fix, but attempts to understand why the fix works are ongoing.

**Update**: We have found and fixed the [underlying issue](https://github.com/ipython/ipykernel/pull/233)

## Action items ##

### Process ###

1. Document and formalize the testing process for post-deployment checks.
2. Set a short timeout (maybe ten minutes?) after which investigation temporarily stops and we revert our deployment to a known good state.

### Upstream KubeSpawner ###

1. Continue investigating https://github.com/jupyterhub/kubespawner/issues/31, which was the core issue that prompted the changes that eventually led to the outage.