Kubernetes Cluster Configuration

We use kubernetes to run our JupyterHubs. It has a healthy open source community, managed offerings from multiple vendors & a fast pace of development. We can run easily on many different cloud providers with similar config by running on top of Kubernetes, so it is also our cloud agnostic abstraction layer.

We prefer using a managed Kubernetes service (such as Google Kubernetes Engine). This document lays out our preferred cluster configuration on various cloud providers.

Google Kubernetes Engine

In our experience, Google Kubernetes Engine (GKE) has been the most stable, performant, and reliable managed kubernetes service. We prefer running on this when possible.

A gcloud container clusters create command can succintly express the configuration of our kubernetes cluster. The following command represents the currently favored configuration.

gcloud container clusters create \
     --enable-ip-alias \
     --enable-autoscaling \
     --max-nodes=20 --min-nodes=1 \
     --region=us-central1 --node-locations=us-central1-b \
     --image-type=cos \
     --disk-size=100 --disk-type=pd-balanced \
     --machine-type=n1-highmem-8 \
     --cluster-version latest \
     --no-enable-autoupgrade \
     --enable-network-policy \
     --create-subnetwork="" \
     --tags=hub-cluster \
     <cluster-name>
gcloud container node-pools create  \
    --machine-type n1-highmem-8 \
    --num-nodes 2 \
    --enable-autoscaling \
    --min-nodes 1 --max-nodes 20 \
    --node-labels hub.jupyter.org/pool-name=beta-pool \
    --node-taints hub.jupyter.org_dedicated=user:NoSchedule \
    --region=us-central1 \
    --image-type=cos \
    --disk-size=200 --disk-type=pd-balanced \
    --no-enable-autoupgrade \
    --tags=hub-cluster \
    --cluster=fall-2019 \
    user-pool-<pool-name>-<yyyy>-<mm>-<dd>

IP Aliasing

--enable-ip-alias creates VPC Native Clusters.

This becomes the default soon, and can be removed once it is the default.

Autoscaling

We use the kubernetes cluster autoscaler to scale our node count up and down based on demand. It waits until the cluster is completely full before triggering creation of a new node - but that’s ok, since new node creation time on GKE is pretty quick.

--enable-autoscaling turns the cluster autoscaler on.

--min-nodes sets the minimum number of nodes that will be maintained regardless of demand. This should ideally be 2, to give us some headroom for quick starts without requiring scale ups when the cluster is completely empty.

--max-nodes sets the maximum number of nodes that the cluster autoscaler will use - this sets the maximum number of concurrent users we can support. This should be set to a reasonably high number, but not too high - to protect against runaway creation of hundreds of VMs that might drain all our credits due to accident or security breach.

Highly available master

The kubernetes cluster’s master nodes are managed by Google Cloud automatically. By default, it is deployed in a non-highly-available configuration - only one node. This means that upgrades and master configuration changes cause a few minutes of downtime for the kubernetes API, causing new user server starts / stops to fail.

We request our cluster masters to have highly available masters with --region parameter. This specifies the region where our 3 master nodes will be spread across in different zones. It costs us nothing extra, so we should always do it.

By default, asking for highly available masters also asks for 3x the node count, spread across multiple zones. We don’t want that, since all our user pods have in-memory state & can’t be relocated. Specifying --node-locations explicitly lets us control how many and which zones the nodes are located in.

Region / Zone selection

We generally use the us-central1 region and a zone in it for our clusters - simply because that is where we have asked for quota.

There are regions closer to us, but latency hasn’t really mattered so we are currently still in us-central1. There are also unsubstantiated rumors that us-central1 is their biggest data center and hence less likely to run out of quota.

Disk Size

--disk-size sets the size of the root disk on all the kubernetes nodes. This isn’t used for any persistent storage such as user home directories. It is only used ephemerally for the operations of the cluster - primarily storing docker images and other temporary storage. We can make this larger if we use a large number of big images, or if we want our image pulls to be faster (since disk performance increases with disk size ).

--disk-type=pd-standard gives us standard spinning disks, which are cheaper. We can also request SSDs instead with --disk-type=pd-ssd - it is much faster, but also much more expensive. We compromise with --disk-type=pd-balanced, faster than spinning disks but not as fast as ssds all the time.

Node size

--machine-type lets us select how much RAM and CPU each of our nodes have. For non-trivial hubs, we generally pick n1-highmem-8, with 52G of RAM and 8 cores. This is based on the following heuristics:

  1. Students generally are memory limited than CPU limited. In fact, while we have a hard limit on memory use per-user pod, we do not have a CPU limit - it hasn’t proven necessary.

  2. We try overprovision clusters by about 2x - so we try to fit about 100G of total RAM use in a node with about 50G of RAM. This is accomplished by setting the memory request to be about half of the memory limit on user pods. This leads to massive cost savings, and works out ok.

  3. There is a kubernetes limit on 100 pods per node.

Based on these heuristics, n1-highmem-8 seems to be most bang for the buck currently. We should revisit this for every cluster creation.

Cluster version

GKE automatically upgrades cluster masters, so there is generally no harm in being on the latest version available.

Node autoupgrades

When node autoupgrades are enabled, GKE will automatically try to upgrade our nodes whenever needed (our GKE version falling off the support window, security issues, etc). However, since we run stateful workloads, we disable this right now so we can do the upgrades manually.

Network Policy

Kubernetes Network Policy lets you firewall internal access inside a kubernetes cluster, whitelisting only the flows you want. The JupyterHub chart we use supports setting up appropriate NetworkPolicy objects it needs, so we should turn it on for additional security depth. Note that any extra in-cluster services we run must have a NetworkPolicy set up for them to work reliabliy.

Subnetwork

We put each cluster in its own subnetwork, since seems to be a limit on how many clusters you can create in the same network with IP aliasing on - you just run out of addresses. This also gives us some isolation - subnetworks are isolated by default and can’t reach other resources. You must add firewall rules to provide access, including access to any manually run NFS servers. We add tags for this.

Tags

To help with firewalling, we add network tags to all our cluster nodes. This lets us add firewall rules to control traffic between subnetworks.

Cluster name

We try use a descriptive name as much as possible.