Skip to content

Commit

Permalink
Update docs
Browse files Browse the repository at this point in the history
Signed-off-by: Yi Chen <[email protected]>
  • Loading branch information
ChenYi015 committed Jun 25, 2024
1 parent 012b52a commit 67bd783
Show file tree
Hide file tree
Showing 16 changed files with 474 additions and 539 deletions.
73 changes: 29 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
# Kubeflow Spark Operator

[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/spark-operator)](https://goreportcard.com/report/github.com/kubeflow/spark-operator)

## Overview
## What is Kubeflow Spark Operator?

The Kubernetes Operator for Apache Spark aims to make specifying and running [Spark](https://github.com/apache/spark) applications as easy and idiomatic as running other workloads on Kubernetes. It uses
[Kubernetes custom resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
for specifying, running, and surfacing status of Spark applications. For a complete reference of the custom resource definitions, please refer to the [API Definition](docs/api-docs.md). For details on its design, please refer to the [design doc](docs/design.md). It requires Spark 2.3 and above that supports Kubernetes as a native scheduler backend.
[Kubernetes custom resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) for specifying, running, and surfacing status of Spark applications.

## Overview

For a complete reference of the custom resource definitions, please refer to the [API Definition](docs/api-docs.md). For details on its design, please refer to the [design doc](docs/design.md). It requires Spark 2.3 and above that supports Kubernetes as a native scheduler backend.

The Kubernetes Operator for Apache Spark currently supports the following list of features:

Expand Down Expand Up @@ -36,61 +41,41 @@ Customization of Spark pods, e.g., mounting arbitrary volumes and setting pod af

* Version >= 1.16 of Kubernetes to use the `MutatingWebhook` and `ValidatingWebhook` of `apiVersion: admissionregistration.k8s.io/v1`.

## Installation

The easiest way to install the Kubernetes Operator for Apache Spark is to use the Helm [chart](charts/spark-operator-chart/).
## Getting Started

```bash
$ helm repo add spark-operator https://kubeflow.github.io/spark-operator
For getting started with Spark operator, please refer to [Getting Started](https://www.kubeflow.org/docs/components/spark-operator/getting-started/).

$ helm install my-release spark-operator/spark-operator --namespace spark-operator --create-namespace
```
## User Guide

This will install the Kubernetes Operator for Apache Spark into the namespace `spark-operator`. The operator by default watches and handles `SparkApplication`s in every namespaces. If you would like to limit the operator to watch and handle `SparkApplication`s in a single namespace, e.g., `default` instead, add the following option to the `helm install` command:
For detailed user guide, please refer to [User Guide](https://www.kubeflow.org/docs/components/spark-operator/user-guide/).

```
--set "sparkJobNamespaces={default}"
```
For API documentation, please refer to [API Specification](docs/api-docs.md).

For configuration options available in the Helm chart, please refer to the chart's [README](charts/spark-operator-chart/README.md).
If you are running Spark operator on Google Kubernetes Engine (GKE) and want to use Google Cloud Storage (GCS) and/or BigQuery for reading/writing data, also refer to the [GCP guide](https://www.kubeflow.org/docs/components/spark-operator/user-guide/gcp/).

## Version Matrix

The following table lists the most recent few versions of the operator.

| Operator Version | API Version | Kubernetes Version | Base Spark Version | Operator Image Tag |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| `latest` (master HEAD) | `v1beta2` | 1.13+ | `3.0.0` | `latest` |
| `v1beta2-1.3.3-3.1.1` | `v1beta2` | 1.16+ | `3.1.1` | `v1beta2-1.3.3-3.1.1` |
| `v1beta2-1.3.2-3.1.1` | `v1beta2` | 1.16+ | `3.1.1` | `v1beta2-1.3.2-3.1.1` |
| `v1beta2-1.3.0-3.1.1` | `v1beta2` | 1.16+ | `3.1.1` | `v1beta2-1.3.0-3.1.1` |
| `v1beta2-1.2.3-3.1.1` | `v1beta2` | 1.13+ | `3.1.1` | `v1beta2-1.2.3-3.1.1` |
| `v1beta2-1.2.0-3.0.0` | `v1beta2` | 1.13+ | `3.0.0` | `v1beta2-1.2.0-3.0.0` |
| `v1beta2-1.1.2-2.4.5` | `v1beta2` | 1.13+ | `2.4.5` | `v1beta2-1.1.2-2.4.5` |
| `v1beta2-1.0.1-2.4.4` | `v1beta2` | 1.13+ | `2.4.4` | `v1beta2-1.0.1-2.4.4` |
| `v1beta2-1.0.0-2.4.4` | `v1beta2` | 1.13+ | `2.4.4` | `v1beta2-1.0.0-2.4.4` |
| `v1beta1-0.9.0` | `v1beta1` | 1.13+ | `2.4.0` | `v2.4.0-v1beta1-0.9.0` |

When installing using the Helm chart, you can choose to use a specific image tag instead of the default one, using the following option:

```
--set image.tag=<operator image tag>
```

## Get Started

Get started quickly with the Kubernetes Operator for Apache Spark using the [Quick Start Guide](docs/quick-start-guide.md).

If you are running the Kubernetes Operator for Apache Spark on Google Kubernetes Engine and want to use Google Cloud Storage (GCS) and/or BigQuery for reading/writing data, also refer to the [GCP guide](docs/gcp.md).

For more information, check the [Design](docs/design.md), [API Specification](docs/api-docs.md) and detailed [User Guide](docs/user-guide.md).
| Operator Version | API Version | Kubernetes Version | Base Spark Version |
| ------------- | ------------- | ------------- | ------------- |
| `v1beta2-1.6.x-3.5.0` | `v1beta2` | 1.16+ | `3.5.0` |
| `v1beta2-1.5.x-3.5.0` | `v1beta2` | 1.16+ | `3.5.0` |
| `v1beta2-1.4.x-3.5.0` | `v1beta2` | 1.16+ | `3.5.0` |
| `v1beta2-1.3.x-3.1.1` | `v1beta2` | 1.16+ | `3.1.1` |
| `v1beta2-1.2.3-3.1.1` | `v1beta2` | 1.13+ | `3.1.1` |
| `v1beta2-1.2.2-3.0.0` | `v1beta2` | 1.13+ | `3.0.0` |
| `v1beta2-1.2.1-3.0.0` | `v1beta2` | 1.13+ | `3.0.0` |
| `v1beta2-1.2.0-3.0.0` | `v1beta2` | 1.13+ | `3.0.0` |
| `v1beta2-1.1.x-2.4.5` | `v1beta2` | 1.13+ | `2.4.5` |
| `v1beta2-1.0.x-2.4.4` | `v1beta2` | 1.13+ | `2.4.4` |

## Contributing

Please check [CONTRIBUTING.md](CONTRIBUTING.md) and the [Developer Guide](docs/developer-guide.md) out.
For contributing, please refer to [CONTRIBUTING.md](CONTRIBUTING.md) and [Developer Guide](https://www.kubeflow.org/docs/components/spark-operator/developer-guide/).

## Community

* Join the [CNCF Slack Channel](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels) and then join ```#kubeflow-spark-operator``` Channel.
* Join the [CNCF Slack Channel](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels) and then join `#kubeflow-spark-operator` Channel.
* Check out our blog post [Announcing the Kubeflow Spark Operator: Building a Stronger Spark on Kubernetes Community](https://blog.kubeflow.org/operators/2024/04/15/kubeflow-spark-operator.html)
* Check out [who is using the Kubernetes Operator for Apache Spark](docs/who-is-using.md).
* Check out [who is using the Spark Operator](docs/adopters.md).
70 changes: 36 additions & 34 deletions docs/who-is-using.md → docs/adopters.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,50 @@
## Who Is Using the Kubernetes Operator for Apache Spark?
# Adopters of Kubeflow Spark Operator

Below are the adopters of project Spark Operator. If you are using Spark Operator please add yourself into the following list by a pull request. Please keep the list in alphabetical order.

| Organization | Contact (GitHub User Name) | Environment | Description of Use |
| ------------- | ------------- | ------------- | ------------- |
| [Caicloud](https://intl.caicloud.io/) | @gaocegege | Production | Cloud-Native AI Platform |
| Microsoft (MileIQ) | @dharmeshkakadia | Production | AI & Analytics |
| Lightbend | @yuchaoran2011 | Production | Data Infrastructure & Operations |
| StackTome | @emiliauskas-fuzzy | Production | Data pipelines |
| Salesforce | @khogeland | Production | Data transformation |
| [Beeline](https://beeline.ru) | @spestua | Evaluation | ML & Data Infrastructure |
| Bringg | @EladDolev | Production | ML & Analytics Data Platform |
| [Siigo](https://www.siigo.com) | @Juandavi1 | Production | Data Migrations & Analytics Data Platform |
| [Caicloud](https://intl.caicloud.io/) | @gaocegege | Production | Cloud-Native AI Platform |
| Carrefour | @AliGouta | Production | Data Platform |
| CERN|@mrow4a| Evaluation | Data Mining & Analytics |
| Lyft |@kumare3| Evaluation | ML & Data Infrastructure |
| MapR Technologies |@sarjeet2013| Evaluation | ML/AI & Analytics Data Platform |
| Uber| @chenqin| Evaluation| Spark / ML |
| HashmapInc| @prem0132 | Evaluation | Analytics Data Platform |
| Tencent | @runzhliu | Evaluation | ML Analytics Platform |
| Exacaster | @minutis | Evaluation | Data pipelines |
| Riskified | @henbh | Evaluation | Analytics Data Platform |
| [CloudPhysics](https://www.cloudphysics.com) | @jkleckner | Production | ML/AI & Analytics |
| CloudZone | @iftachsc | Evaluation | Big Data Analytics Consultancy |
| Cyren | @avnerl | Evaluation | Data pipelines |
| Shell (Agile Hub) | @TomLous | Production | Data pipelines |
| Nielsen Identity Engine | @roitvt | Evaluation | Data pipelines |
| [C2FO](https://www.c2fo.com/) | @vanhoale | Production | Data Platform / Data Infrastructure |
| [Data Mechanics](https://www.datamechanics.co) | @jrj-d | Production | Managed Spark Platform |
| [PUBG](https://careers.pubg.com/#/en/) | @jacobhjkim | Production | ML & Data Infrastructure |
| [Beeline](https://beeline.ru) | @spestua | Evaluation | ML & Data Infrastructure |
| [Stitch Fix](https://multithreaded.stitchfix.com/) | @nssalian | Evaluation | Data pipelines |
| [Typeform](https://typeform.com/) | @afranzi | Production | Data & ML pipelines |
| incrmntal(https://incrmntal.com/) | @scravy | Production | ML & Data Infrastructure |
| [CloudPhysics](https://www.cloudphysics.com) | @jkleckner | Production | ML/AI & Analytics |
| [MongoDB](https://www.mongodb.com) | @chickenpopcorn | Production | Data Infrastructure |
| [MavenCode](https://www.mavencode.com) | @charlesa101 | Production | MLOps & Data Infrastructure |
| [Gojek](https://www.gojek.io/) | @pradithya | Production | Machine Learning Platform |
| Fossil | @duyet | Production | Data Platform |
| Carrefour | @AliGouta | Production | Data Platform |
| Scaling Smart | @tarek-izemrane | Evaluation | Data Platform |
| [Tongdun](https://www.tongdun.net/) | @lomoJG | Production | AI/ML & Analytics |
| [Totvs Labs](https://www.totvslabs.com) | @luizm | Production | Data Platform |
| [DiDi](https://www.didiglobal.com) | @Run-Lin | Evaluation | Data Infrastructure |
| [DeepCure](https://www.deepcure.ai) | @mschroering | Production | Spark / ML |
| [C2FO](https://www.c2fo.com/) | @vanhoale | Production | Data Platform / Data Infrastructure |
| [Timo](https://timo.vn) | @vanducng | Production | Data Platform |
| [DiDi](https://www.didiglobal.com) | @Run-Lin | Evaluation | Data Infrastructure |
| Exacaster | @minutis | Evaluation | Data pipelines |
| Fossil | @duyet | Production | Data Platform |
| [Gojek](https://www.gojek.io/) | @pradithya | Production | Machine Learning Platform |
| HashmapInc| @prem0132 | Evaluation | Analytics Data Platform |
| [incrmntal](https://incrmntal.com/) | @scravy | Production | ML & Data Infrastructure |
| [Inter&Co](https://inter.co/) | @ignitz | Production | Data pipelines |
| [Kognita](https://kognita.com.br/) | @andreclaudino | Production | MLOps, Data Platform / Data Infrastructure, ML/AI |
| Lightbend | @yuchaoran2011 | Production | Data Infrastructure & Operations |
| Lyft |@kumare3| Evaluation | ML & Data Infrastructure |
| MapR Technologies |@sarjeet2013| Evaluation | ML/AI & Analytics Data Platform |
| [MavenCode](https://www.mavencode.com) | @charlesa101 | Production | MLOps & Data Infrastructure |
| Microsoft (MileIQ) | @dharmeshkakadia | Production | AI & Analytics |
| [Molex](https://www.molex.com/) | @AshishPushpSingh | Evaluation/Production | Data Platform |
| [MongoDB](https://www.mongodb.com) | @chickenpopcorn | Production | Data Infrastructure |
| Nielsen Identity Engine | @roitvt | Evaluation | Data pipelines |
| [PUBG](https://careers.pubg.com/#/en/) | @jacobhjkim | Production | ML & Data Infrastructure |
| [Qualytics](https://www.qualytics.co/) | @josecsotomorales | Production | Data Quality Platform |
| Riskified | @henbh | Evaluation | Analytics Data Platform |
| [Roblox](https://www.roblox.com/) | @matschaffer-roblox | Evaluation | Data Infrastructure |
| [Rokt](https://www.rokt.com) | @jacobsalway | Production | Data Infrastructure |
| [Inter&Co](https://inter.co/) | @ignitz | Production | Data pipelines |
| Salesforce | @khogeland | Production | Data transformation |
| Scaling Smart | @tarek-izemrane | Evaluation | Data Platform |
| Shell (Agile Hub) | @TomLous | Production | Data pipelines |
| [Siigo](https://www.siigo.com) | @Juandavi1 | Production | Data Migrations & Analytics Data Platform |
| StackTome | @emiliauskas-fuzzy | Production | Data pipelines |
| [Stitch Fix](https://multithreaded.stitchfix.com/) | @nssalian | Evaluation | Data pipelines |
| Tencent | @runzhliu | Evaluation | ML Analytics Platform |
| [Timo](https://timo.vn) | @vanducng | Production | Data Platform |
| [Tongdun](https://www.tongdun.net/) | @lomoJG | Production | AI/ML & Analytics |
| [Totvs Labs](https://www.totvslabs.com) | @luizm | Production | Data Platform |
| [Typeform](https://typeform.com/) | @afranzi | Production | Data & ML pipelines |
| Uber| @chenqin| Evaluation| Spark / ML |
4 changes: 1 addition & 3 deletions docs/developer-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ pre-commit install-hooks

In case you want to build the operator from the source code, e.g., to test a fix or a feature you write, you can do so following the instructions below.

The easiest way to build the operator without worrying about its dependencies is to just build an image using the [Dockerfile](../Dockerfile).
The easiest way to build the operator without worrying about its dependencies is to just build an image using the [Dockerfile](https://github.com/kubeflow/spark-operator/Dockerfile).

```bash
docker build -t <image-tag> .
Expand All @@ -39,8 +39,6 @@ The operator image is built upon a base Spark image that defaults to `spark:3.5.
docker build --build-arg SPARK_IMAGE=<your Spark image> -t <image-tag> .
```

If you want to use the operator on OpenShift clusters, first make sure you have Docker version 18.09.3 or above, then build your operator image using the [OpenShift-specific Dockerfile](../Dockerfile.rh).

```bash
export DOCKER_BUILDKIT=1
docker build -t <image-tag> -f Dockerfile.rh .
Expand Down
Loading

0 comments on commit 67bd783

Please sign in to comment.