Create a New Hub
Why create a new hub?
The major reasons for making a new hub are:
- A new course wants to join the Berkeley DataHub community.
- One of your students are course staff in another course and have elevated access, enabling them to see other students’ work.
- You want to use a different kind of authenticator.
- You are running in a different cloud, or using a different billing account.
- Your environment is different enough and specialized enough that a different hub is a good idea. By default, everyone uses the same image as datahub.berkeley.edu.
- You want a different URL (X.datahub.berkeley.edu vs just datahub.berkeley.edu)
Please let us know if you have some other justification for creating a new hub.
Prerequisites
Working installs of the following utilities:
The easiest way to install chartpress
, cookiecutter
and hubploy
is to run pip install -r dev-requirements.txt
from the root of the datahub
repo.
Proper access to the following systems:
- Google Cloud IAM: owner
- Write access to the datahub repo
- Owner or admin access to the berkeley-dsep-infra organization
Configuring a New Hub
Name the hub
Choose the hub name, e.g. data8, stat20, biology, julia, which is typically the name of the course or department. This is permanent.
Determine deployment needs
Before creating a new hub, have a discussion with the instructor about the system requirements, frequency of assignments and how much storage will be required for the course. Typically, there are three general “types” of hub: Heavy usage, general and small courses.
Small courses will usually have one or two assignments per semester, and may only have 20 or fewer users.
General courses have up to ~500 users, but don’t have large amount of data or require upgraded compute resources.
Heavy usage courses can potentially have thousands of users, require upgraded node specs and/or have Terabytes of data each semester.
Both general and heavy usage courses typically have weekly assignments.
Small courses (and some general usage courses) can use either or both of a shared node pool and filestore to save money (Basic HDD filestore instances start at 1T).
This is also a good time to determine if there are any specific software packages/libraries that need to be installed, as well as what language(s) the course will be using. This will determine which image to use, and if we will need to add additional packages to the image build.
If you’re going to use an existing node pool and/or filestore instance, you can skip either or both of the following steps and pick back up at the cookiecutter
.
When creating a new hub, we also make sure to label the filestore and GKE/node pool resources with both hub
and <nodepool|filestore>-deployment
. 99.999% of the time, the values for all three of these labels will be <hubname>
.
Creating a new node pool
Create the node pool:
gcloud container node-pools create "user-<hubname>-<YYYY-MM-DD>" \
--labels=hub=<hubname>,nodepool-deployment=<hubname> \
--node-labels hub.jupyter.org/pool-name=<hubname>-pool \
--machine-type "n2-highmem-8" \
--enable-autoscaling --min-nodes "0" --max-nodes "20" \
--project "ucb-datahub-2018" --cluster "spring-2024" \
--region "us-central1" --node-locations "us-central1-b" \
--node-taints hub.jupyter.org_dedicated=user:NoSchedule --tags hub-cluster \
--image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "200" \
--metadata disable-legacy-endpoints=true \
--scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
--no-enable-autoupgrade --enable-autorepair \
--max-surge-upgrade 1 --max-unavailable-upgrade 0 --max-pods-per-node "110"
Creating a new filestore instance
Before you create a new filestore instance, be sure you know the capacity required. The smallest amount you can allocate is 1T, but larger hubs may require more. Confer with the admins and people instructing the course and determine how much they think they will need.
We can easily scale capacity up, but not down.
From the command line, first fill in the instance name (<hubname>-<YYYY-MM-DD>
) and <capacity>
, and then execute the following command:
gcloud filestore instances create <hubname>-<YYYY-MM-DD> \
--zone "us-central1-b" --tier="BASIC_HDD" \
--file-share=capacity=1TiB,name=shares \
--network=name=default,connect-mode=DIRECT_PEERING
Or, from the web console, click on the horizontal bar icon at the top left corner.
- Access “Filestore” > “Instances” and click on “Create Instance”.
- Name the instance
<hubname>-<YYYY-MM-DD>
- Instance Type is
Basic
, Storage Type isHDD
. - Allocate capacity.
- Set the region to
us-central1
and Zone tous-central1-b
. - Set the VPC network to
default
. - Set the File share name to
shares
. - Click “Create” and wait for it to be deployed.
- Once it’s deployed, select the instance and copy the “NFS mount point”.
Your new (but empty) NFS filestore must be seeded with a pair of directories. We run a utility VM for NFS filestore management; follow the steps below to connect to this utility VM, mount your new filestore, and create & configure the required directories.
You can run the following command in gcloud terminal to log in to the NFS utility VM:
gcloud compute ssh nfsserver-01 --zone=us-central1-b --tunnel-through-iap
Alternatively, launch console.cloud.google.com > Select ucb-datahub-2018 as the project name.
- Click on the three horizontal bar icon at the top left corner.
- Access “Compute Engine” > “VM instances” > and search for “nfs-server-01”.
- Select “Open in browser window” option to access NFS server via GUI.
Back in the NFS utility VM shell, mount the new share:
mkdir /export/<hubname>-filestore
mount <filestore share IP>:/shares /export/<hubname>-filestore
Create staging
and prod
directories owned by 1000:1000
under /export/<hubname>-filestore/<hubname>
. The path might differ if your hub has special home directory storage needs. Consult admins if that’s the case. Here is the command to create the directory with appropriate permissions:
install -d -o 1000 -g 1000 \
<hubname>-filestore/<hubname>/staging \
/export/<hubname>-filestore/<hubname>/prod /export/
Check whether the directories have permissions similar to the below directories:
drwxr-xr-x 4 ubuntu ubuntu 45 Nov 3 20:33 a11y-filestore
drwxr-xr-x 4 ubuntu ubuntu 33 Jan 4 2022 astro-filestore
drwxr-xr-x 4 ubuntu ubuntu 16384 Aug 16 18:45 biology-filestore
Create the hub deployment locally
In the datahub/deployments
directory, run cookiecutter
. This sets up the hub’s configuration directory:
cookiecutter template/
- The cookiecutter template will prompt you to provide the following information:
-
<hub_name>
: Enter the chosen name of the hub.<project_name>
: Default isucb-datahub-2018
, do not change.<cluster_name>
: Default isspring-2024
, do not change.<pool_name>
: Name of the node pool (shared or individual) to deploy on.hub_filestore_share
: Default isshares
, do not change.hub_filestore_ip
: Enter the IP address of the filestore instance. This is available from the web console.hub_filestore_capacity
: Enter the allocated storage capacity. This is available from the web console.
This will generate a directory with the name of the hub you provided with a skeleton configuration and all the necessary secrets.
Configure filestore security settings and GCP billing labels
If you have created a new filestore instance, you will now need to apply the ROOT_SQUASH
settings. Please ensure that you’ve already created the hub’s root directory and both staging
and prod
directories, otherwise you will lose write access to the share. We also attach labels to a new filestore instance for tracking individual and full hub costs.
Skip this step if you are using an existing/shared filestore.
gcloud filestore instances update <filestore-instance-name> --zone=us-central1-b \
--update-labels=hub=<hubname>,filestore-deployment=<hubname> \
--flags-file=<hubname>/config/filestore/squash-flags.json
Authentication
Set up authentication via bcourses. We have two canvas OAuth2 clients setup in bcourses for us - one for all production hubs and one for all staging hubs. The configuration and secrets for these are provided by the cookiecutter template, however the new hubs need to be added to the authorized callback list maintained in bcourses.
Use
sops
to editsecrets/staging.yaml
andsecrets/prod.yaml
, replacing the cookiecutter hub_name.cookiecutter
can’t do this for you since the values are encrypted.Add
<hub_name>-staging.datahub.berkeley.edu/hub/oauth_callback
to the staging hub client (id 10720000000000594)Add
<hub_name>.datahub.berkeley.edu/hub/oauth_callback
to the production hub client (id 10720000000000472)Copy gke-key.json from any other hub’s secrets to the hub’s secrets/
Please reach out to Jonathan Felder to set this up, or bcourseshelp@berkeley.edu if he is not available.
CI/CD and single-user server image
CI/CD is managed through Github Actions, and the relevant workflows are located in .github/workflows/
. Deploying all hubs are managed via Pull Request Labels, which are applied automatically on PR creation.
To ensure the new hub is deployed, all that needs to be done is add a new entry (alphabetically) in .github/labeler.yml
under the # add hub-specific labels for deployment changes
stanza:
"hub: <hubname>":
- "deployments/<hubname>/**"
Hubs using a custom single-user server image
If this hub will be using its own image, then follow the instructions here to create the new image and repository. In this case, the image tag will be PLACEHOLDER
and will be updated AFTER your PR to datahub is merged.
NOTE: The changes to the datahub
repo are required to be merged BEFORE the new image configuration is pushed to main
in the image repo. This is due to the image building/pushing workflow requiring this deployment’s hubploy.yaml
to be present in the deployments/<hubname>/
subdirectory, as it updates the image tag.
Hubs inheriting an existing single-user server image
If this hub will inherit an existing image, you can just copy hubploy.yaml
from an existing deployment which will contain the latest image hash.
Review the deployment’s hubploy.yaml
Next, review hubploy.yaml
inside your project directory to confirm that looks cromulent. An example from the a11y
hub:
images:
images:
- name: us-central1-docker.pkg.dev/ucb-datahub-2018/user-images/a11y-user-image:<image tag OR "PLACEHOLDER">
Create placeholder node pool
Node pools have a configured minimum size, but our cluster has the ability to set aside additional placeholder nodes. These are nodes that get spun up in anticipation of the pool needing to suddenly grow in size, for example when large classes begin.
If you are deploying to a shared node pool, there is no need to perform this step.
Otherwise, you’ll need to add the placeholder settings in node-placeholder/values.yaml
.
The node placeholder pod should have enough RAM allocated to it that it needs to be kicked out to get even a single user pod on the node - but not so big that it can’t run on a node where other system pods are running! To do this, we’ll find out how much memory is allocatable to pods on that node, then subtract the sum of all non-user pod memory requests and an additional 256Mi of “wiggle room”. This final number will be used to allocate RAM for the node placeholder.
Launch a server on https://hubname.datahub.berkeley.edu
Get the node name (it will look something like
gke-spring-2024-user-datahub-2023-01-04-fc70ea5b-67zs
):kubectl get nodes | grep *hubname* | awk '{print $1}'
Get the total amount of memory allocatable to pods on this node and convert to bytes:
bash kubectl get node <nodename> -o jsonpath='{.status.allocatable.memory}'
Get the total memory used by non-user pods/containers on this node. We explicitly ignore
notebook
andpause
. Convert to bytes and get the sum:bash kubectl get -A pod -l 'component!=user-placeholder' \ --field-selector spec.nodeName=<nodename> \ -o jsonpath='{range .items[*].spec.containers[*]}{.name}{"\t"}{.resources.requests.memory}{"\n"}{end}' \ | egrep -v 'pause|notebook'
Subtract the second number from the first, and then subtract another 277872640 bytes (256Mi) for “wiggle room”.
Add an entry for the new placeholder node config in
values.yaml
:
data102:
nodeSelector:
hub.jupyter.org/pool-name: data102-pool
resources:
requests:
# Some value slightly lower than allocatable RAM on the node pool
memory: 60929654784
replicas: 1
For reference, here’s example output from collecting and calculating the values for data102
:
(gcpdev) ➜ ~ kubectl get nodes | grep data102 | awk '{print$1}'
gke-spring-2024-user-data102-2023-01-05-e02d4850-t478
(gcpdev) ➜ ~ kubectl get node gke-spring-2024-user-data102-2023-01-05-e02d4850-t478 -o jsonpath='{.status.allocatable.memory}' # convert to bytes
60055600Ki%
(gcpdev) ➜ ~ kubectl get -A pod -l 'component!=user-placeholder' \
\
--field-selector spec.nodeName=gke-spring-2024-user-data102-2023-01-05-e02d4850-t478 '{range .items[*].spec.containers[*]}{.name}{"\t"}{.resources.requests.memory}{"\n"}{end}' \
-o jsonpath=| egrep -v 'pause|notebook' # convert all values to bytes, sum them
calico-node
fluentbit 100Mi
fluentbit-gke 100Mi
gke-metrics-agent 60Mi
ip-masq-agent 16Mi
kube-proxy
prometheus-node-exporter
(gcpdev) ➜ ~ # subtract the sum of the second command's values from the first value, then subtract another 277872640 bytes for wiggle room
(gcpdev) ➜ ~ # in this case: (60055600Ki - (100Mi + 100Mi + 60Mi + 16Mi)) - 256Mi
(gcpdev) ➜ ~ # (61496934400 - (104857600 + 104857600 + 16777216 + 62914560)) - 277872640 == 60929654784
Besides setting defaults, we can dynamically change the placeholder counts by either adding new, or editing existing, calendar events. This is useful for large courses which can have placeholder nodes set aside for predicatable periods of heavy ramp up.
Commit and deploy to staging
Commit the hub directory, and make a PR to the the staging
branch in the GitHub repo.
Hubs using a custom single-user server image
If this hub is using a custom image, and you’re using PLACEHOLDER
for the image tag in hubploy.yaml
, be sure to remove the hub-specific Github label that is automatically attached to this pull request. It will look something like hub: <hubname>
. If you don’t do this the deployment will fail as the image sha of PLACEHOLDER
doesn’t exist.
After this PR is merged, perform the git push
in your image repo. This will trigger the workflow that builds the image, pushes it to the Artifact Registry, and finally creates a commit that updates the image hash in hubploy.yaml
and pushes to the datahub repo. Once this is merged in to staging
, the deployment pipeline will run and your hub will finally be deployed.
Hubs inheriting an existing single-user server image
Your hub’s deployment will proceed automatically through the CI/CD pipeline.
It might take a few minutes for HTTPS to work, but after that you can log into it at https://<hub_name>-staging.datahub.berkeley.edu. Test it out and make sure things work as you think they should.
Commit and deploy to prod
Make a PR from the staging
branch to the prod
branch. When this PR is merged, it’ll deploy the production hub. It might take a few minutes for HTTPS to work, but after that you can log into it at https://<hub_name>.datahub.berkeley.edu. Test it out and make sure things work as you think they should.