Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(helm): GCP support #2666

Merged
merged 8 commits into from
Sep 27, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 19 additions & 70 deletions helm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,76 +4,55 @@ name: Kubernetes Deployment

# OpenNeuro Kubernetes Deployment

This chart is used to deploy a copy of OpenNeuro and all required services excluding any CDN, MongoDB, and ElasticSearch.
This chart is used to deploy a copy of OpenNeuro and all required services excluding MongoDB, and ElasticSearch.

On AWS, this chart is deployed using Amazon's managed Kubernetes service (EKS). An ingress creates the load balancer routing to backend services and this is fronted by CloudFront for caching.
On GCP, this chart is designed to support GKE with Autopilot for deployment. Only worker disks and cluster creation are configured outside of this chart.

Written for Helm 3.0.0 or later

## Major components

- API deployment - GraphQL service (@openneuro/server npm package)
- DataLad service deployment - Falcon server for microservice operations on datasets
- Web deployment - SSR and static resources including the React application (@openneuro/app npm package)
- Web deployment - Nginx serving static resources including the React application (@openneuro/app npm package)

## Pre-requisites

Install [Helm](https://helm.sh/), [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/), and (optionally) [eksctl](https://eksctl.io/).
Install [Helm](https://helm.sh/), [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/), and [gcloud](https://cloud.google.com/sdk/docs/install).

Helm manages configuration templates. Kubectl makes API calls to Kubernetes on your behalf. eksctl configures AWS specific EKS resources to simplify control plane setup and is most useful when creating a new cluster or changing node groups.
Helm manages configuration templates. Kubectl makes API calls to Kubernetes on your behalf. gcloud is used to create and authenticate with the cluster.

## Cluster Setup

### Create a cluster on AWS
### gcloud setup

```bash
eksctl create cluster --name=my-cluster-name --nodegroup-name=general --nodes=2 --instance-type=c5a.xlarge --node-ami-family=Ubuntu1804
```

This should configure the cluster and setup credentials and command context for later kubectl and helm commands. If you encounter errors here, your user likely lacks access to manage EC2, EKS, or CloudFormation resources on the AWS account.

OpenNeuro uses at least two node groups to run. A general node group created as above and a secondary node group assigned to storage resources only.
Set the default project to use for gcloud commands.

```bash
eksctl create nodegroup --cluster=my-cluster-name --nodes=2 --instance-type=m5ad.xlarge --name=storage
gcloud config set project hs-openneuro
```

Example eksctl configurations from the main OpenNeuro instance are available in [staging](eksctl-cluster-prod.yaml) and [production](eksctl-cluster-staging.yaml) configurations.

### Storage setup

OpenEBS is used to manage volume allocation for worker nodes. Your Kubernetes nodes requires OpenZFS configuration. See [OpenEBS for supported versions](https://github.com/openebs/zfs-localpv#prerequisites). This can be built into the AMI on EKS or installed at node creation by eksctl as in the above example cluster configuration files.

Storage pool nodes should be labeled to allow migration of the EBS disks on EKS updates. Label each node like so - this must be done before installing zfs-localpv the first time.
### Create a cluster

```bash
kubectl label node node-1 openebs.io/nodeid=pool-a
kubectl label node node-2 openebs.io/nodeid=pool-b
gcloud container clusters create-auto openneuro-dev --region=us-west1
```

Once the cluster is running, initialize the CSI driver for OpenEBS ZFS LocalPV following the [install instructions](https://github.com/openebs/zfs-localpv#setup).

### Setup and access Kubernetes dashboard
This will configure the cluster and setup credentials and command context for later kubectl and helm commands. This requires IAM permissions for Kubernetes Engine.

To install:
OpenNeuro runs with autopilot which automatically allocates node resources as requested by the container requests: field.

```bash
helm install dashboard stable/kubernetes-dashboard
```
### Storage setup

To access:
pd-standard is balanced performance using SSDs. This provides sufficient git operation performance for interactive use of multiple datasets sharing one worker.

```bash
# Setup a port forward to the Dashboard pod
export POD_NAME=$(kubectl get pods -n default -l "app=kubernetes-dashboard,release=dashboard" -o jsonpath="{.items[0].metadata.name}")
kubectl -n default port-forward $POD_NAME 8443:8443
# Obtain an admin token
kubectl -n kube-system describe secret $(kubectl -n kube-system get secret | grep eks-admin | awk '{print $1}')
gcloud compute disks create openneuro-staging-datasets-0 --zone us-west1-b --size 256Gi --type pd-standard
```

### Configuration

This chart is AWS specific at the moment, as OpenNeuro requires EC2 EBS and ALB resources to run as configured. Pull requests welcome if you add support for other hosting environments.
This chart is GCP specific at the moment due to mainly the disk configuration and load balancer ingress setup, minimal changes are required to run in other Kubernetes environments, mainly overriding the ingress and allocating moderately performant disks for the dataset worker containers.

To get started create a `values.yaml` and `secrets.yaml` file. In values.yaml you will override any chart settings necessary for your target environment. For a minimal dev environment it may look like this:

Expand All @@ -82,41 +61,11 @@ hostname: my.dev.site.domain
url: https://my.dev.site.domain
environment: any-unique-string
googleTrackingIds: ''
storagePools:
stripeSize: 1099511627776 # 1TB EBS disks
pools:
- name: a
size: 2199023255552 # 2TB per pool
- name: b
size: 2199023255552
workerDiskSize:
- id: projects/my-dev-project/zones/us-west1-b/disks/openneuro-dev-datasets-0
size: 256Gi
```

Storage pools are local to a specific node. Generally you should add one pool for each node assigned to the storage node group. It is possible to assign multiple pools to one node but this will prevent even load distribution across volumes.

Disks are automatically allocated by the pool size divided by stripe size. Each "stripe" is one block persistent volume backing the pool. Multiple volumes are sparsely allocated from the pool. The pool can be much smaller than the quota size for the volumes within it as long as the total requested storage is below the pool's real available size.

The pool size can be adjusted automatically when increasing the size. To scale down a pool, the underlying EBS disks need to be removed from the pool first, and then manually removed.

```bash
# Locate the correct node, then run
zpool remove nvme-Amazon_Elastic_Block_Store_vol0123457908104
# Wait until the disk leaves the removing state, takes a while
watch -n 60 zpool list -v
```

Remove the PVCs once the pool is no longer using them.

```bash
# Save the volume ID
kubectl get pvc storage-pool-release-name-storage-pool-a-1
# Delete PVC first
kubectl delete pvc storage-pool-release-name-storage-pool-a-1
# Delete the volume once freed (this will delete the EBS disk, so be sure here!)
kubectl delete pv pvc-3a29528a-7b17-40b6-96a7-6385316fb401
```

Other values which can be overridden are found in the chart version of [values.yaml](charts/values.yaml).

`secrets.yaml` contains any privileged configuration, like database connection strings or oauth secrets. Start with [secrets.yaml.example](secrets.yaml.example) and fill in each value. Most values are required, but you only need one authentication provider and mail, doi, and flower configuration is optional.

### Installing
Expand Down
5 changes: 1 addition & 4 deletions helm/openneuro/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,5 @@ sources:
appVersion: 4.10.0
dependencies:
- name: redis
version: 10.6.17
version: 17.1.4
repository: https://charts.bitnami.com/bitnami
- name: aws-alb-ingress-controller
version: 1.0.0
repository: https://kubernetes-charts-incubator.storage.googleapis.com/
12 changes: 3 additions & 9 deletions helm/openneuro/requirements.lock
Original file line number Diff line number Diff line change
@@ -1,12 +1,6 @@
dependencies:
- name: redis
repository: https://charts.bitnami.com/bitnami
version: 10.6.17
- name: aws-alb-ingress-controller
repository: https://kubernetes-charts-incubator.storage.googleapis.com/
version: 1.0.0
- name: apm-server
repository: https://helm.elastic.co
version: 7.9.0
digest: sha256:f90c303d40dcf002907179598ef0f1fda76b2edbc560f838910051aa9a2c21ab
generated: "2020-08-21T14:22:28.124302759-07:00"
version: 17.1.4
digest: sha256:7e7d63886296a858981054168160d78ed785ea99c89f2f04ad0f22b9447268fb
generated: "2022-09-13T14:04:03.820425553-07:00"
13 changes: 8 additions & 5 deletions helm/openneuro/templates/api-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,19 +24,22 @@ spec:
- name: {{ .Release.Name }}-api
image: 'openneuro/server:v{{ .Chart.AppVersion }}'
resources:
limits:
cpu: "1.2"
memory: "2Gi"
requests:
cpu: ".3"
memory: "768Mi"
cpu: {{ .Values.apiCpuRequests }}
memory: {{ .Values.apiMemoryRequests }}
ports:
- containerPort: 8111
envFrom:
- configMapRef:
name: {{ .Release.Name }}-configmap
- secretRef:
name: {{ .Release.Name }}-secret
readinessProbe:
initialDelaySeconds: 15
periodSeconds: 30
httpGet:
path: '/crn/'
port: 8111
livenessProbe:
initialDelaySeconds: 60
periodSeconds: 30
Expand Down
2 changes: 0 additions & 2 deletions helm/openneuro/templates/api-service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@ metadata:
chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
release: "{{ .Release.Name }}"
heritage: "{{ .Release.Service }}"
annotations:
alb.ingress.kubernetes.io/healthcheck-path: /crn/
spec:
ports:
- port: 8111
Expand Down
2 changes: 0 additions & 2 deletions helm/openneuro/templates/configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,5 @@ data:
REDIS_PORT: "6379"
GRAPHQL_ENDPOINT: http://{{ .Release.Name }}-api:8111/crn/graphql
DATALAD_SERVICE_URI: {{ .Release.Name }}-dataset-worker
DATALAD_S3_PUBLIC_ON_EXPORT: "yes"
LOCPATH: ""
SENTRY_DSN: {{ .Values.sentryDsn | quote }}
ELASTIC_APM_SERVER_URL: {{ .Values.apmServerUrl }}
2 changes: 0 additions & 2 deletions helm/openneuro/templates/datalad-worker-service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@ apiVersion: v1
kind: Service
metadata:
name: {{ $relname }}-dataset-worker-{{ . }}
annotations:
alb.ingress.kubernetes.io/healthcheck-path: /heartbeat
spec:
selector:
statefulset.kubernetes.io/pod-name: {{ $relname }}-dataset-worker-{{ . }}
Expand Down
22 changes: 22 additions & 0 deletions helm/openneuro/templates/dataset-worker-pv.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# The Worker PV matches GCP disks to statefulset claims (see dataset-worker-stateful-set)
{{- $relname := .Release.Name -}}
{{- range $index, $config := .Values.workerDiskSize }}
apiVersion: v1
kind: PersistentVolume
metadata:
name: datasets-{{ $relname }}-dataset-worker-{{ $index }}
spec:
storageClassName: {{ $relname }}-datasets
capacity:
storage: {{ $config.size }}
accessModes:
- ReadWriteOnce
claimRef:
namespace: default
name: datasets-{{ $relname }}-dataset-worker-{{ $index }}
csi:
driver: pd.csi.storage.gke.io
volumeHandle: {{ $config.id }}
fsType: ext4
---
{{- end }}
29 changes: 16 additions & 13 deletions helm/openneuro/templates/dataset-worker-stateful-set.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,13 @@ spec:
replicas: {{ .Values.dataladWorkers }}
volumeClaimTemplates:
- metadata:
name: datasets-ebs
name: datasets
spec:
storageClassName: {{ .Release.Name }}-datasets
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Pi
storage: 10Gi
template:
metadata:
labels:
Expand All @@ -26,12 +25,8 @@ spec:
checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
checksum/secret: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }}
spec:
tolerations:
- key: "storage"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
role: storage
topology.kubernetes.io/zone: {{ .Values.workerZone }}
volumes:
- name: ssh-key
secret:
Expand All @@ -45,12 +40,20 @@ spec:
image: 'openneuro/datalad-service:v{{ .Chart.AppVersion }}'
command: ["gunicorn", "--bind", "0.0.0.0:80", "--reload", "datalad_service.app:create_app('/datasets')", "--workers", "8", "--worker-class", "gevent", "--timeout", "60", "--keep-alive", "30"]
resources:
limits:
cpu: "8"
memory: "12Gi"
requests:
cpu: {{ .Values.workerCpuRequests }}
memory: "4Gi"
memory: {{ .Values.workerMemoryRequests }}
readinessProbe:
periodSeconds: 5
initialDelaySeconds: 15
httpGet:
path: /heartbeat
port: 80
livenessProbe:
periodSeconds: 60
httpGet:
path: /heartbeat
port: 80
ports:
- containerPort: 80
envFrom:
Expand All @@ -59,7 +62,7 @@ spec:
- secretRef:
name: {{ .Release.Name }}-secret
volumeMounts:
- name: datasets-ebs
- name: datasets
mountPath: /datasets
- name: ssh-key
mountPath: /datalad-key
Expand Down
10 changes: 10 additions & 0 deletions helm/openneuro/templates/dataset-worker-storage-class.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: {{ .Release.Name }}-datasets
provisioner: pd.csi.storage.gke.io
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
parameters:
type: pd-balanced
csi.storage.k8s.io/fstype: ext4
9 changes: 3 additions & 6 deletions helm/openneuro/templates/indexer-job.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
apiVersion: batch/v1beta1
apiVersion: batch/v1
kind: CronJob
metadata:
name: {{ .Release.Name }}-indexer
Expand All @@ -17,12 +17,9 @@ spec:
- name: openneuro-indexer
image: 'openneuro/indexer:v{{ .Chart.AppVersion }}'
resources:
limits:
cpu: ".5"
memory: "512Mi"
requests:
cpu: ".1"
memory: "256Mi"
cpu: ".25"
memory: "512Mi"
envFrom:
- configMapRef:
name: {{ .Release.Name }}-configmap
Expand Down
Loading