Weird upstream ipython bug kills kernels
Summary
A seemingly unrelated change caused user kernels to die on start (making notebook execution impossible) for newly started user servers from about Mar 22 19:30 to Mar 23 09:45. Most users didn’t see any errors until start of class at about 9AM, since they were running servers that were previously started.
Timeline
March 22, around 19:30
A deployment is performed, finally deploying https://github.com/data-8/jupyterhub-k8s/pull/146 to production. It seemed to work fine on -dev, and on prod as well. However, the testing regimen was only to see if a notebook server would show up - not if a kernel would spawn.
Mar 23, 09:08
Students report that their kernels keep dying. This is confirmed to be a problem for all newly launched notebooks, in both prod and dev.
09:16
The last change to the repo (an update of the single-user image) is reverted, to check if that was causing the problem. This does not improve the situation. Debugging continues, but with no obvious angles of attack.
09:41
After debugging produces no obvious culprits, the state of the entire infrastructure for prod is reverted to a known good state from a few days ago. This was done with:
./deploy.py prod data8 25abea764121953538713134e8a08e0291813834
25abea764121953538713134e8a08e0291813834
is the commit hash of a known good commit from March 19. Our disciplined adherence to immutable & reproducible deployment paid off, and we were able to restore new servers to working order with this!
Students are now able to resume working after a server restart. A mass restart is also performed to aid this.
Dev is left in a broken state in an attempt to debug.
09:48
A core Jupyter Notebook dev at BIDS attempts to debug the problem, since it seems to be with the notebook itself and not with JupyterHub.
11:08
Core Jupyter Notebook dev confirms that this makes no sense.
14:55
Attempts to isolate the bug start again, mostly by using git bisect
to deploy different versions of our infrastructure to dev until we find what broke.
15:30
https://github.com/data-8/jupyterhub-k8s/pull/146 is identified as the culprit. It continues to not make sense.
17:25
A very involved and laborious revert of the offending part of the patch is done in https://github.com/jupyterhub/kubespawner/pull/37. Core Jupyter Notebook dev continues to confirm this makes no sense.
https://github.com/data-8/jupyterhub-k8s/pull/152 is also merged, and deployed shortly after verifying that everything (including starting kernels & executing code) works fine on dev. Deployed to prod and everything is fine.
Conclusion
Insufficient testing procedures caused a new kind of outage (kernel dying) that we had not seen before. However, since our infrastructure was immutable & reproducible, our outage really only lasted about 40 minutes (from start of lab when students were starting containers until the revert). Deeper debugging produced a fix, but attempts to understand why the fix works are ongoing.
Update: We have found and fixed the underlying issue
Action items
Process
- Document and formalize the testing process for post-deployment checks.
- Set a short timeout (maybe ten minutes?) after which investigation temporarily stops and we revert our deployment to a known good state.
Upstream KubeSpawner
- Continue investigating https://github.com/jupyterhub/kubespawner/issues/31, which was the core issue that prompted the changes that eventually led to the outage.