UC Berkeley JupyterHubs
We use kubernetes to run our JupyterHubs. It has
a healthy open source community, managed offerings from multiple vendors &
a fast pace of development. We can run easily on many different cloud
providers with similar config by running on top of Kubernetes, so it is also
our cloud agnostic abstraction layer.
We prefer using a managed Kubernetes service (such as Google Kubernetes Engine). This document lays out our
preferred cluster configuration on various cloud providers.
In our experience, Google Kubernetes Engine (GKE) has been the most stable,
performant, and reliable managed kubernetes service. We prefer running on this
A gcloud container clusters create command can succintly express the
configuration of our kubernetes cluster. The following command represents
the currently favored configuration.
gcloud container clusters create
gcloud container clusters create \
--max-nodes=20 --min-nodes=1 \
--region=us-central1 --node-locations=us-central1-b \
--disk-size=100 --disk-type=pd-balanced \
--cluster-version latest \
gcloud container node-pools create \
--machine-type n1-highmem-8 \
--num-nodes 2 \
--min-nodes 1 --max-nodes 20 \
--node-labels hub.jupyter.org/pool-name=beta-pool \
--node-taints hub.jupyter.org_dedicated=user:NoSchedule \
--disk-size=200 --disk-type=pd-balanced \
--enable-ip-alias creates VPC Native Clusters.
This becomes the default soon, and can be removed once it is the default.
We use the kubernetes cluster autoscaler
to scale our node count up and down based on demand. It waits until the cluster is completely full
before triggering creation of a new node - but that’s ok, since new node creation time on GKE is
--enable-autoscaling turns the cluster autoscaler on.
--min-nodes sets the minimum number of nodes that will be maintained
regardless of demand. This should ideally be 2, to give us some headroom for
quick starts without requiring scale ups when the cluster is completely empty.
--max-nodes sets the maximum number of nodes that the cluster autoscaler
will use - this sets the maximum number of concurrent users we can support.
This should be set to a reasonably high number, but not too high - to protect
against runaway creation of hundreds of VMs that might drain all our credits
due to accident or security breach.
The kubernetes cluster’s master nodes are managed by Google Cloud automatically.
By default, it is deployed in a non-highly-available configuration - only one
node. This means that upgrades and master configuration changes cause a few minutes
of downtime for the kubernetes API, causing new user server starts / stops to fail.
We request our cluster masters to have highly available masters
with --region parameter. This specifies the region where our 3 master nodes
will be spread across in different zones. It costs us nothing extra, so we should
always do it.
By default, asking for highly available masters also asks for 3x the node count,
spread across multiple zones. We don’t want that, since all our user pods have
in-memory state & can’t be relocated. Specifying --node-locations explicitly
lets us control how many and which zones the nodes are located in.
We generally use the us-central1 region and a zone in it for our clusters -
simply because that is where we have asked for quota.
There are regions closer to us, but latency hasn’t really mattered so we are
currently still in us-central1. There are also unsubstantiated rumors that us-central1 is their
biggest data center and hence less likely to run out of quota.
--disk-size sets the size of the root disk on all the kubernetes nodes. This
isn’t used for any persistent storage such as user home directories. It is only
used ephemerally for the operations of the cluster - primarily storing docker
images and other temporary storage. We can make this larger if we use a large number
of big images, or if we want our image pulls to be faster (since disk performance
increases with disk size
--disk-type=pd-standard gives us standard spinning disks, which are cheaper. We
can also request SSDs instead with --disk-type=pd-ssd - it is much faster,
but also much more expensive. We compromise with --disk-type=pd-balanced, faster
than spinning disks but not as fast as ssds all the time.
--machine-type lets us select how much RAM and CPU
each of our nodes have. For non-trivial hubs, we generally pick n1-highmem-8, with 52G
of RAM and 8 cores. This is based on the following heuristics:
Students generally are memory limited than CPU limited. In fact, while we
have a hard limit on memory use per-user pod, we do not have a CPU limit -
it hasn’t proven necessary.
We try overprovision clusters by about 2x - so we try to fit about 100G of total RAM
use in a node with about 50G of RAM. This is accomplished by setting the memory
request to be about half of the memory limit on user pods. This leads to massive
cost savings, and works out ok.
There is a kubernetes limit on 100 pods per node.
Based on these heuristics, n1-highmem-8 seems to be most bang for the buck
currently. We should revisit this for every cluster creation.
GKE automatically upgrades cluster masters, so there is generally no harm in being
on the latest version available.
When node autoupgrades are enabled, GKE will automatically try to
upgrade our nodes whenever needed (our GKE version falling off the
support window, security issues, etc). However, since we run stateful
workloads, we disable this right now so we can do the upgrades
Kubernetes Network Policy
lets you firewall internal access inside a kubernetes cluster, whitelisting
only the flows you want. The JupyterHub chart we use supports setting up
appropriate NetworkPolicy objects it needs, so we should turn it on for
additional security depth. Note that any extra in-cluster services we run
must have a NetworkPolicy set up for them to work reliabliy.
We put each cluster in its own subnetwork, since seems to be a limit on how
many clusters you can create in the same network with IP aliasing on - you
just run out of addresses. This also gives us some isolation - subnetworks
are isolated by default and can’t reach other resources. You must add
firewall rules to
provide access, including access to any manually run NFS servers.
We add tags for this.
To help with firewalling, we add network tags
to all our cluster nodes. This lets us add firewall rules to control traffic
We try use a descriptive name as much as possible.