Calendar Node Pool Autoscaler
Why scale node pools with Google Calendar?
The scheduler isn’t perfect for us, especially when large classes have assignments due and a hub is flooded with students. This “hack” was introduced to improve cluster scaling prior to known events.
These ‘placeholder’ nodes are used to minimize the delay that occurs when GCP creates new node pools during mass user logins. This common, especially for larger classes.
Structure
There is a Google Calendar calendar, DataHub Scaling Events shared with all infrastructure staff. The event descriptions should contain a YAML fragment, and are of the form pool_name: count, where the name is the corresponding hub name (data100, stat20) and the count is the number of extra nodes you want. There can be several pools defined, one per line.
By default, we usually have one spare node ready to go, so if the count in the calendar event is set to 0 or 1, there will be no change to the cluster. If the value is set to >=2, additional hot spares will be created. If a value is set more than once, the entry with the greater value will be used.
You can determine how many placeholder nodes to have up based on how many people you expect to log in at once. Some of the bigger courses may require 2 or more placeholder nodes, but during “regular” hours, 1 is usually sufficient.
The scaling mechanism is implemented as the node-placeholder-node-placeholder-scaler deployment within the node-placeholder namespace. The source code is within https://github.com/berkeley-dsep-infra/datahub/tree/staging/images/node-placeholder-scaler.
Calendar Autoscaler
The code for the calendar autoscaler is a python 3.11 script, located here: https://github.com/berkeley-dsep-infra/datahub/tree/staging/images/node-placeholder-scaler/scaler
How the scaler works
There is a k8s pod running in the node-placeholder namespace, which simply runs python3 -m scaler. This script runs in an infinite loop, and every 60 seconds checks the scaler config and calendar for entries. It then uses the highest value provided as the number of placeholder replicas for any given hub. This means that if there’s a daily evening event to ‘cool down’ the number of replicas for all hubs to 0, and a simultaneous event to set one or more hubs to a higher number, the scaler will see this and keep however many node placeholders specified up and ready to go.
After determining the number of replicas needed for each hub, the scaler will create a k8s template and run kubectl in the pod.
Updating the scaler config
The scaler config sets the default number of node-placeholders that are running at any given time. These values can be overridden by creating events in the DataHub Scaling Events calendar.
When classes are in session, these defaults are all typically set to 1, and during breaks (or when a hub is not expected to be in use) they can be set to 0.
After making changes to values.yaml, create a PR normally and our CI will push the new config out to the node-placeholder pod. There is no need to manually restart the node-placeholder pod as the changes will be picked up automatically.
Working on, testing and deploying the calendar scaler
All file locations in this section will assume that you are in the datahub/images/node-placeholder-scaler/ directory.
It is strongly recommended that you create a new python 3.11 environment before doing any dev work on the scaler. With conda, you can run the following commands to create one:
conda create -ny scalertest python=3.11
pip install -r images/node-placeholder-scaler/requirements.txtAny changes to the scaler code will require you to run chartpress to redeploy the scaler to GCP.
Here is an example of how you can test any changes to scaler/calendar.py locally in the python interpreter:
# these tests will use some dates culled from the calendar with varying numbers of events.
import scaler.calendar
import datetime
import zoneinfo
tz = zoneinfo.ZoneInfo(key='America/Los_Angeles')
zero_events_noon_june = datetime.datetime(2023, 6, 14, 12, 0, 0, tzinfo=tz)
one_event_five_pm_april = datetime.datetime(2023, 4, 27, 17, 0, 0, tzinfo=tz)
three_events_eight_thirty_pm_march = datetime.datetime(2023, 3, 6, 20, 30, 0, tzinfo=tz)
calendar = scaler.calendar.get_calendar('https://calendar.google.com/calendar/ical/c_s47m3m1nuj3s81187k3b2b5s5o%40group.calendar.google.com/public/basic.ics')
zero_events = scaler.calendar.get_events(calendar, time=zero_events_noon_june)
one_event = scaler.calendar.get_events(calendar, time=one_event_five_pm_april)
three_events = scaler.calendar.get_events(calendar, time=three_events_eight_thirty_pm_march)
assert len(zero_events) == 0
assert len(one_event) == 1
assert len(three_events) == 3get_events returns a list of ical ical.event.Event class objects.
The method for testing scaler/scaler.py is similar to above, but the only things you’ll be able test locally are the make_deployment() and get_replica_counts() functions.
When you’re ready, create a PR. The deployment workflow is as follows:
- Get all authed-up for
chartpressby performing the documented steps. - Run
chartpress --pushfrom the rootdatahub/directory. If this succeeds, check yourgit statusand adddatahub/node-placeholder/Chart.yamlanddatahub/node-placeholder/values.ymlto your PR. - Merge to
stagingand thenprod.
Changing python imports
The python requirements file is generated using requirements.in and pip-compile. If you need to change/add/update any packages, you’ll need to do the following:
- Ensure you have the correct python environment activated (see above).
- Pip install
pip-tools - Edit
requirements.inand save your changes. - Execute
pip-compile requirements.in, which will update therequirements.txt. - Check your git status and diffs, and create a pull request if necessary.
- Get all authed-up for
chartpressby performing the documented steps. - Run
chartpress --pushfrom the rootdatahub/directory. If this succeeds, check yourgit statusand adddatahub/node-placeholder/Chart.yamlanddatahub/node-placeholder/values.ymlto your PR. - Merge to
stagingand thenprod.
Monitoring
You can monitor the scaling by watching for events:
kubectl -n node-placeholder get events -wAnd by tailing the logs of the pod with the scalar process:
kubectl -n node-placeholder logs -l app.kubernetes.io/name=node-placeholder-scaler -fFor example if you set epsilon: 2, you might see in the pod logs:
2022-10-17 21:36:45,440 Found event Stat20/Epsilon test 2 2022-10-17 14:21 PDT to 15:00 PDT
2022-10-17 21:36:45,441 Overrides: {'epsilon': 2}
2022-10-17 21:36:46,475 Setting epsilon to have 2 replicas