Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README and documentation #2047

Merged
merged 3 commits into from
Jun 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 36 additions & 34 deletions docs/who-is-using.md → ADOPTERS.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,50 @@
## Who Is Using the Kubernetes Operator for Apache Spark?
# Adopters of Kubeflow Spark Operator

Below are the adopters of project Spark Operator. If you are using Spark Operator please add yourself into the following list by a pull request. Please keep the list in alphabetical order.

| Organization | Contact (GitHub User Name) | Environment | Description of Use |
| ------------- | ------------- | ------------- | ------------- |
| [Caicloud](https://intl.caicloud.io/) | @gaocegege | Production | Cloud-Native AI Platform |
| Microsoft (MileIQ) | @dharmeshkakadia | Production | AI & Analytics |
| Lightbend | @yuchaoran2011 | Production | Data Infrastructure & Operations |
| StackTome | @emiliauskas-fuzzy | Production | Data pipelines |
| Salesforce | @khogeland | Production | Data transformation |
| [Beeline](https://beeline.ru) | @spestua | Evaluation | ML & Data Infrastructure |
| Bringg | @EladDolev | Production | ML & Analytics Data Platform |
| [Siigo](https://www.siigo.com) | @Juandavi1 | Production | Data Migrations & Analytics Data Platform |
| [Caicloud](https://intl.caicloud.io/) | @gaocegege | Production | Cloud-Native AI Platform |
| Carrefour | @AliGouta | Production | Data Platform |
| CERN|@mrow4a| Evaluation | Data Mining & Analytics |
| Lyft |@kumare3| Evaluation | ML & Data Infrastructure |
| MapR Technologies |@sarjeet2013| Evaluation | ML/AI & Analytics Data Platform |
| Uber| @chenqin| Evaluation| Spark / ML |
| HashmapInc| @prem0132 | Evaluation | Analytics Data Platform |
| Tencent | @runzhliu | Evaluation | ML Analytics Platform |
| Exacaster | @minutis | Evaluation | Data pipelines |
| Riskified | @henbh | Evaluation | Analytics Data Platform |
| [CloudPhysics](https://www.cloudphysics.com) | @jkleckner | Production | ML/AI & Analytics |
| CloudZone | @iftachsc | Evaluation | Big Data Analytics Consultancy |
| Cyren | @avnerl | Evaluation | Data pipelines |
| Shell (Agile Hub) | @TomLous | Production | Data pipelines |
| Nielsen Identity Engine | @roitvt | Evaluation | Data pipelines |
| [C2FO](https://www.c2fo.com/) | @vanhoale | Production | Data Platform / Data Infrastructure |
| [Data Mechanics](https://www.datamechanics.co) | @jrj-d | Production | Managed Spark Platform |
| [PUBG](https://careers.pubg.com/#/en/) | @jacobhjkim | Production | ML & Data Infrastructure |
| [Beeline](https://beeline.ru) | @spestua | Evaluation | ML & Data Infrastructure |
| [Stitch Fix](https://multithreaded.stitchfix.com/) | @nssalian | Evaluation | Data pipelines |
| [Typeform](https://typeform.com/) | @afranzi | Production | Data & ML pipelines |
| incrmntal(https://incrmntal.com/) | @scravy | Production | ML & Data Infrastructure |
| [CloudPhysics](https://www.cloudphysics.com) | @jkleckner | Production | ML/AI & Analytics |
| [MongoDB](https://www.mongodb.com) | @chickenpopcorn | Production | Data Infrastructure |
| [MavenCode](https://www.mavencode.com) | @charlesa101 | Production | MLOps & Data Infrastructure |
| [Gojek](https://www.gojek.io/) | @pradithya | Production | Machine Learning Platform |
| Fossil | @duyet | Production | Data Platform |
| Carrefour | @AliGouta | Production | Data Platform |
| Scaling Smart | @tarek-izemrane | Evaluation | Data Platform |
| [Tongdun](https://www.tongdun.net/) | @lomoJG | Production | AI/ML & Analytics |
| [Totvs Labs](https://www.totvslabs.com) | @luizm | Production | Data Platform |
| [DiDi](https://www.didiglobal.com) | @Run-Lin | Evaluation | Data Infrastructure |
| [DeepCure](https://www.deepcure.ai) | @mschroering | Production | Spark / ML |
| [C2FO](https://www.c2fo.com/) | @vanhoale | Production | Data Platform / Data Infrastructure |
| [Timo](https://timo.vn) | @vanducng | Production | Data Platform |
| [DiDi](https://www.didiglobal.com) | @Run-Lin | Evaluation | Data Infrastructure |
| Exacaster | @minutis | Evaluation | Data pipelines |
| Fossil | @duyet | Production | Data Platform |
| [Gojek](https://www.gojek.io/) | @pradithya | Production | Machine Learning Platform |
| HashmapInc| @prem0132 | Evaluation | Analytics Data Platform |
| [incrmntal](https://incrmntal.com/) | @scravy | Production | ML & Data Infrastructure |
| [Inter&Co](https://inter.co/) | @ignitz | Production | Data pipelines |
| [Kognita](https://kognita.com.br/) | @andreclaudino | Production | MLOps, Data Platform / Data Infrastructure, ML/AI |
| Lightbend | @yuchaoran2011 | Production | Data Infrastructure & Operations |
| Lyft |@kumare3| Evaluation | ML & Data Infrastructure |
| MapR Technologies |@sarjeet2013| Evaluation | ML/AI & Analytics Data Platform |
| [MavenCode](https://www.mavencode.com) | @charlesa101 | Production | MLOps & Data Infrastructure |
| Microsoft (MileIQ) | @dharmeshkakadia | Production | AI & Analytics |
| [Molex](https://www.molex.com/) | @AshishPushpSingh | Evaluation/Production | Data Platform |
| [MongoDB](https://www.mongodb.com) | @chickenpopcorn | Production | Data Infrastructure |
| Nielsen Identity Engine | @roitvt | Evaluation | Data pipelines |
| [PUBG](https://careers.pubg.com/#/en/) | @jacobhjkim | Production | ML & Data Infrastructure |
| [Qualytics](https://www.qualytics.co/) | @josecsotomorales | Production | Data Quality Platform |
| Riskified | @henbh | Evaluation | Analytics Data Platform |
| [Roblox](https://www.roblox.com/) | @matschaffer-roblox | Evaluation | Data Infrastructure |
| [Rokt](https://www.rokt.com) | @jacobsalway | Production | Data Infrastructure |
| [Inter&Co](https://inter.co/) | @ignitz | Production | Data pipelines |
| Salesforce | @khogeland | Production | Data transformation |
| Scaling Smart | @tarek-izemrane | Evaluation | Data Platform |
| Shell (Agile Hub) | @TomLous | Production | Data pipelines |
| [Siigo](https://www.siigo.com) | @Juandavi1 | Production | Data Migrations & Analytics Data Platform |
| StackTome | @emiliauskas-fuzzy | Production | Data pipelines |
| [Stitch Fix](https://multithreaded.stitchfix.com/) | @nssalian | Evaluation | Data pipelines |
| Tencent | @runzhliu | Evaluation | ML Analytics Platform |
| [Timo](https://timo.vn) | @vanducng | Production | Data Platform |
| [Tongdun](https://www.tongdun.net/) | @lomoJG | Production | AI/ML & Analytics |
| [Totvs Labs](https://www.totvslabs.com) | @luizm | Production | Data Platform |
| [Typeform](https://typeform.com/) | @afranzi | Production | Data & ML pipelines |
| Uber| @chenqin| Evaluation| Spark / ML |
81 changes: 35 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
# Kubeflow Spark Operator

[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/spark-operator)](https://goreportcard.com/report/github.com/kubeflow/spark-operator)

## Overview
## What is Spark Operator?

The Kubernetes Operator for Apache Spark aims to make specifying and running [Spark](https://github.com/apache/spark) applications as easy and idiomatic as running other workloads on Kubernetes. It uses
[Kubernetes custom resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
for specifying, running, and surfacing status of Spark applications. For a complete reference of the custom resource definitions, please refer to the [API Definition](docs/api-docs.md). For details on its design, please refer to the [design doc](docs/design.md). It requires Spark 2.3 and above that supports Kubernetes as a native scheduler backend.
[Kubernetes custom resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) for specifying, running, and surfacing status of Spark applications.

## Overview

For a complete reference of the custom resource definitions, please refer to the [API Definition](docs/api-docs.md). For details on its design, please refer to the [Architecture](https://www.kubeflow.org/docs/components/spark-operator/overview/#architecture). It requires Spark 2.3 and above that supports Kubernetes as a native scheduler backend.

The Kubernetes Operator for Apache Spark currently supports the following list of features:

Expand All @@ -28,69 +33,53 @@ The Kubernetes Operator for Apache Spark currently supports the following list o

**If you are currently using the `v1beta1` version of the APIs in your manifests, please update them to use the `v1beta2` version by changing `apiVersion: "sparkoperator.k8s.io/<version>"` to `apiVersion: "sparkoperator.k8s.io/v1beta2"`. You will also need to delete the `previous` version of the CustomResourceDefinitions named `sparkapplications.sparkoperator.k8s.io` and `scheduledsparkapplications.sparkoperator.k8s.io`, and replace them with the `v1beta2` version either by installing the latest version of the operator or by running `kubectl create -f manifest/crds`.**

Customization of Spark pods, e.g., mounting arbitrary volumes and setting pod affinity, is implemented using a Kubernetes [Mutating Admission Webhook](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/), which became beta in Kubernetes 1.9. The mutating admission webhook is disabled by default if you install the operator using the Helm [chart](charts/spark-operator-chart). Check out the [Quick Start Guide](docs/quick-start-guide.md#using-the-mutating-admission-webhook) on how to enable the webhook.

## Prerequisites

* Version >= 1.13 of Kubernetes to use the [`subresource` support for CustomResourceDefinitions](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#subresources), which became beta in 1.13 and is enabled by default in 1.13 and higher.

* Version >= 1.16 of Kubernetes to use the `MutatingWebhook` and `ValidatingWebhook` of `apiVersion: admissionregistration.k8s.io/v1`.

## Installation
## Getting Started

The easiest way to install the Kubernetes Operator for Apache Spark is to use the Helm [chart](charts/spark-operator-chart/).
For getting started with Spark operator, please refer to [Getting Started](https://www.kubeflow.org/docs/components/spark-operator/getting-started/).

```bash
$ helm repo add spark-operator https://kubeflow.github.io/spark-operator
## User Guide

$ helm install my-release spark-operator/spark-operator --namespace spark-operator --create-namespace
```
For detailed user guide and API documentation, please refer to [User Guide](https://www.kubeflow.org/docs/components/spark-operator/user-guide/) and [API Specification](docs/api-docs.md).

This will install the Kubernetes Operator for Apache Spark into the namespace `spark-operator`. The operator by default watches and handles `SparkApplication`s in every namespaces. If you would like to limit the operator to watch and handle `SparkApplication`s in a single namespace, e.g., `default` instead, add the following option to the `helm install` command:

```
--set "sparkJobNamespaces={default}"
```

For configuration options available in the Helm chart, please refer to the chart's [README](charts/spark-operator-chart/README.md).
If you are running Spark operator on Google Kubernetes Engine (GKE) and want to use Google Cloud Storage (GCS) and/or BigQuery for reading/writing data, also refer to the [GCP guide](https://www.kubeflow.org/docs/components/spark-operator/user-guide/gcp/).

## Version Matrix

The following table lists the most recent few versions of the operator.

| Operator Version | API Version | Kubernetes Version | Base Spark Version | Operator Image Tag |
| ------------- | ------------- | ------------- | ------------- | ------------- |
| `latest` (master HEAD) | `v1beta2` | 1.13+ | `3.0.0` | `latest` |
| `v1beta2-1.3.3-3.1.1` | `v1beta2` | 1.16+ | `3.1.1` | `v1beta2-1.3.3-3.1.1` |
| `v1beta2-1.3.2-3.1.1` | `v1beta2` | 1.16+ | `3.1.1` | `v1beta2-1.3.2-3.1.1` |
| `v1beta2-1.3.0-3.1.1` | `v1beta2` | 1.16+ | `3.1.1` | `v1beta2-1.3.0-3.1.1` |
| `v1beta2-1.2.3-3.1.1` | `v1beta2` | 1.13+ | `3.1.1` | `v1beta2-1.2.3-3.1.1` |
| `v1beta2-1.2.0-3.0.0` | `v1beta2` | 1.13+ | `3.0.0` | `v1beta2-1.2.0-3.0.0` |
| `v1beta2-1.1.2-2.4.5` | `v1beta2` | 1.13+ | `2.4.5` | `v1beta2-1.1.2-2.4.5` |
| `v1beta2-1.0.1-2.4.4` | `v1beta2` | 1.13+ | `2.4.4` | `v1beta2-1.0.1-2.4.4` |
| `v1beta2-1.0.0-2.4.4` | `v1beta2` | 1.13+ | `2.4.4` | `v1beta2-1.0.0-2.4.4` |
| `v1beta1-0.9.0` | `v1beta1` | 1.13+ | `2.4.0` | `v2.4.0-v1beta1-0.9.0` |

When installing using the Helm chart, you can choose to use a specific image tag instead of the default one, using the following option:
| Operator Version | API Version | Kubernetes Version | Base Spark Version |
| ------------- | ------------- | ------------- | ------------- |
| `v1beta2-1.6.x-3.5.0` | `v1beta2` | 1.16+ | `3.5.0` |
| `v1beta2-1.5.x-3.5.0` | `v1beta2` | 1.16+ | `3.5.0` |
| `v1beta2-1.4.x-3.5.0` | `v1beta2` | 1.16+ | `3.5.0` |
| `v1beta2-1.3.x-3.1.1` | `v1beta2` | 1.16+ | `3.1.1` |
| `v1beta2-1.2.3-3.1.1` | `v1beta2` | 1.13+ | `3.1.1` |
| `v1beta2-1.2.2-3.0.0` | `v1beta2` | 1.13+ | `3.0.0` |
| `v1beta2-1.2.1-3.0.0` | `v1beta2` | 1.13+ | `3.0.0` |
| `v1beta2-1.2.0-3.0.0` | `v1beta2` | 1.13+ | `3.0.0` |
| `v1beta2-1.1.x-2.4.5` | `v1beta2` | 1.13+ | `2.4.5` |
| `v1beta2-1.0.x-2.4.4` | `v1beta2` | 1.13+ | `2.4.4` |

```
--set image.tag=<operator image tag>
```
## Developer Guide

## Get Started
For developing with Spark Operator, please refer to [Developer Guide](https://www.kubeflow.org/docs/components/spark-operator/developer-guide/).

Get started quickly with the Kubernetes Operator for Apache Spark using the [Quick Start Guide](docs/quick-start-guide.md).
## Contributor Guide

If you are running the Kubernetes Operator for Apache Spark on Google Kubernetes Engine and want to use Google Cloud Storage (GCS) and/or BigQuery for reading/writing data, also refer to the [GCP guide](docs/gcp.md).
For contributing to Spark Operator, please refer to [Contributor Guide](CONTRIBUTING.md).

For more information, check the [Design](docs/design.md), [API Specification](docs/api-docs.md) and detailed [User Guide](docs/user-guide.md).

## Contributing
## Community

Please check [CONTRIBUTING.md](CONTRIBUTING.md) and the [Developer Guide](docs/developer-guide.md) out.
* Join the [CNCF Slack Channel](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels) and then join `#kubeflow-spark-operator` Channel.
vara-bonthu marked this conversation as resolved.
Show resolved Hide resolved
* Check out our blog post [Announcing the Kubeflow Spark Operator: Building a Stronger Spark on Kubernetes Community](https://blog.kubeflow.org/operators/2024/04/15/kubeflow-spark-operator.html).
* Join our monthly community meeting [Kubeflow Spark Operator Meeting Notes](https://bit.ly/3VGzP4n).

## Community
## Adopters

* Join the [CNCF Slack Channel](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels) and then join ```#kubeflow-spark-operator``` Channel.
* Check out our blog post [Announcing the Kubeflow Spark Operator: Building a Stronger Spark on Kubernetes Community](https://blog.kubeflow.org/operators/2024/04/15/kubeflow-spark-operator.html)
* Check out [who is using the Kubernetes Operator for Apache Spark](docs/who-is-using.md).
Check out [adopters of Spark Operator](ADOPTERS.md).
1 change: 0 additions & 1 deletion docs/_config.yml

This file was deleted.

Binary file removed docs/architecture-diagram.png
Binary file not shown.
Loading
Loading