Continuous Integration and Deployment

Overview

DataHub’s continuous integration and deployment system uses both Github Actions and workflows.

These workflows are stored in the DataHub repo in the .github/workflows/ directory.

The basic order of operations is as follows:

A pull request is created in the datahub repo.
The labeler workflow applies labels based on the file type and/or location.
When the pull request is merged to staging, if the labels match any hub, support or node placeholder deployments those specific systems are deployed.
When the pull request is merged to prod, only the hubs that have been modified are deployed (again based on labels).

The hubs are deployed via hubploy, which is our custom wrapper for gcloud, sops and helm.

Github Actions Architecture

Secrets and Variables

All of these workflows depend on a few Actions secrets and variables, with some at the organization level, and others at the repository level.

Organization secrets and variables

GitHub Actions settings contain all of the organizational secrets and variables.

Organization Secrets

DATAHUB_CREATE_PR: This secret is a fine-grained personal access token, and has the following permissions defined:

Select repositories (only berkeley-dsep-infra/datahub)
Repository permissions: Contents (read/write), Metadata (read only), Pull requests (read/write)

When adding a new image repository in the berkeley-dsep-infra org, you must edit this secret and manually add this repository to the access list.

Important

This PAT has an lifetime of 366 days, and should be rotated at the beginning of every maintenance window.

GAR_SECRET_KEY and GAR_SECRET_KEY_EDX: These secrets are for the GCP IAM roles for each GCP project given roles/storage.admin permissions. This allows us to push the built images to the Artifact Registry.

When adding a new image repository in the berkeley-dsep-infra org, you must edit this secret and manually add this repository to the access list.

Organization Variables

IMAGE_BUILDER_BOT_EMAIL and IMAGE_BUILDER_BOT_NAME: These are used to set the git identity in the image build workflow step that pushes a commit and creates a PR in the datahub repo.

DataHub repository secrets

GCP_PROJECT_ID: This is the name of our GCP project.
GKE_KEY: This key is used in the workflows that deploy the support and node-placeholder namespaces. It’s attached to the hubploy service account, and has the assigned roles of roles/container.clusterViewer and roles/container.developer.
SOPS_KEY: This key is used to decrypt our secrets using sops, and is attached to the sopsaccount service account and provides KMS access.

User Image Repository Variables

Each image repository contains two variables, which are used to identify the name of the hub, and the path within the Artifact Registry that it’s published to.

HUB: The name of the hub, natch! datahub, data100, etc.
IMAGE: The path within the artifact registry: ucb-datahub-2018/user-images/<hubname>-user-image

Single user server image modification workflow

Each hub’s user image is located in the berkeley-dsep-infra’s organization. When a pull request is submitted, there are two workflows that run:

When both tests pass, and the pull request is merged in to the main branch, a third and final workflow is run:

Build push and create PR

This builds the image again, and when successful pushes it to our Google Artifact Registry and creates a pull request in the datahub repository with the updated image tag for that hub’s deployment.

Updating the datahub repository

Single user server image tag updates

When a pull request is opened to update one or more image tags for deployments, the labeler will apply the hub: <hubname> label upon creation. When this pull request is merged, the deploy-hubs workflow is triggered.

This workflow will then grab the labels from the merged pull request, see if any hubs need to be deployed and if so, execute a python script that checks the environment variables within that workflow for hubs, and emits a list of what’s to be deployed.

That list is iterated over, and hubploy is used to deploy only the flagged hubs.

%% State diagram documentation at
%% https://mermaid.js.org/syntax/stateDiagram.html

stateDiagram-v2
    image_repo: github.com/berkeley-dsep-infra/hubname-user-image
    user_repo: github.com/username/hubname-user-image
    image_test_build: Image is built and tested
    image_push_build: Image is built and pushed to registry
    pr_created: A pull request is automatically<br/>created in the Datahub repo
    deploy_to_staging: Hub is deployed to staging
    contributor_tests: The contributor logs into the<br/>staging hub and tests the image.
    deploy_to_prod: Hub is deployed to prod

    image_repo --> user_repo: Contributor forks the image repo.
    user_repo --> image_repo: Contributor creates a PR.
    image_repo --> image_test_build
    image_test_build --> image_push_build: Test build passes and Datahub staff merge pull request
    image_push_build --> pr_created
    pr_created --> deploy_to_staging: Datahub staff review and merge to staging
    deploy_to_staging --> contributor_tests
    contributor_tests --> deploy_to_prod: Datahub staff create a PR to merge to prod

Support and node-placeholder charts

Each of these deployments has their own workflow, which only runs on pushes to staging:

If the correct labels are found, it will use the GKE_KEY secret to run helm upgrade for the necessary deployments.

Miscellaneous workflows

There are also a couple of other workflows in the datahub repository:

prevent-prod-merges.yml: This workflow will only allow us to merge to prod from staging.
quarto-docs.yml: This builds, renders and pushes our docs to Github Pages.

Documentation’s Workflow

This documentation is also deployed by GitHub Actions.

Original Design Document

Slides describe the process in some more detail.