Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adoption of Spark-on-k8s-operator #648

Closed
mwielgus opened this issue Sep 5, 2023 · 38 comments
Closed

Adoption of Spark-on-k8s-operator #648

mwielgus opened this issue Sep 5, 2023 · 38 comments

Comments

@mwielgus
Copy link

mwielgus commented Sep 5, 2023

We are looking for a new home for Spark-on-k8s-operator. The project was quite active for years, delivering a convenient way of running Spark in the Kubernetes environment. Unfortunately, due to some org changes, the previous maintainers are unable to provide enough time and love the project and its users deserve. So, GoogleCloudPlatform would like to transfer ownership of the code (already on the Apache license) to an organisation that would help to bring more life to the project and continue to help users run Spark on K8S. Given that you support a wide variety of ML/batch frameworks (MPI, TF, Pytorch etc) we think that Kubeflow would be a good place for the Spark operator.

@mwielgus
Copy link
Author

mwielgus commented Sep 5, 2023

cc: @terrytangyuan

@terrytangyuan
Copy link
Member

terrytangyuan commented Sep 5, 2023

+1 happy to sponsor this. This would be a great addition to the Kubeflow community. cc @james-jwu @theadactyl

cc @kubeflow/wg-training-leads

@andreyvelich
Copy link
Member

Thank you for proposing this @mwielgus!

I agree, that Spark operator might be useful for Kubeflow users who want to do Data Preparation, Feature extraction, Data Validation, etc. before building and training their ML models. Currently, Kubeflow doesn't offer such functionality.

It would be nice if you could join our upcoming AutoML and Training WG Community call today (September 6th) at 6pm UTC (10am PST) to discuss the details and potential use-cases.

cc @kubeflow/wg-training-leads @tenzen-y @kuizhiqing

@johnugeorge
Copy link
Member

Is this proposal to haev Spark operator to be independent operator in Kubeflow?

@mwielgus
Copy link
Author

mwielgus commented Sep 6, 2023

Thanks, I will join the meeting today :).

@tenzen-y
Copy link
Member

tenzen-y commented Sep 6, 2023

Basically, SGTM. However, I have the same question as @johnugeorge said.

@jbottum
Copy link
Contributor

jbottum commented Sep 6, 2023

FYI, The Kubeflow User Survey(s) have consistently shown that users would like a Spark / Kubeflow integration.

@jbottum
Copy link
Contributor

jbottum commented Sep 6, 2023

We will discuss if and how Kubeflow will support a Spark K8s operator in our Community Meeting on Tuesday, please find the bridge in these meeting notes. I suspect there maybe several operators or implementations and we need to decide if we are going to pick one, how it will be supported, if it is part of a (new) Kubeflow Working Group, how it is installed, etc. @kimwnasptd @mwielgus Kubeflow community meeting notes: https://docs.google.com/document/d/1Wdxt1xedAj7qF_Rjmxy1R0NRdfv7UWs-r2PItewxHpE/edit.
Screenshot 2023-09-06 at 1 20 29 PM

@thesuperzapper

@juliusvonkohout
Copy link
Member

mwielgus we had a spark operator before. Are they using the modern sparkconnect? https://spark.apache.org/docs/latest/spark-connect-overview.html

You can already use the kubernetes apiserver as spark master. So i am wondering whether that + spark-connect is alaready enough. Anyway i am open to contributions in manifests/contrib.

@andreyvelich
Copy link
Member

Here is the recording for our initial discussion on Sep 9th around Spark Operator in Kubeflow: https://youtu.be/3D2h5OUNCQo.

@mwielgus Please can you attend Kubeflow Community call today at 8:00am PST, so we can have a followup discussion around Spark Operator: https://docs.google.com/document/d/1Wdxt1xedAj7qF_Rjmxy1R0NRdfv7UWs-r2PItewxHpE/edit#heading=h.xtqde2br5mh4.

cc @kubeflow/wg-training-leads

@mwielgus
Copy link
Author

@andreyvelich I will be there.

@andreyvelich
Copy link
Member

Thank you, Marcin!

@jbottum
Copy link
Contributor

jbottum commented Sep 12, 2023

As a follow-up to our recent Apache Spark discussions in the Kubeflow Community meetings, we are requesting some user input... If you are a Spark user or contributor, the Kubeflow Community would like to know if you need active support for a Spark Kubernetes Operator. If so, would you please comment or +1 on this GitHub issue. We need at least 10 users and would appreciate any ideas on use cases i.e. integration with notebooks or Kubeflow pipelines. Thanks! Josh

@droctothorpe
Copy link

IMO, the fundamental gap is the lack of an SDK. Data scientists would rather write python than yaml (for good reason). There needs to be (a) some clarification (and documentation) about the benefits of the spark operator over pyspark, and (b) development of an SDK (perhaps an extension to the training operator SDK).

@charlesa101
Copy link

@droctothorpe, We currently use the SparkOperator in a few of our projects, It makes it easy for us to deploy Spark Jobs "natively" on K8s, more like how the training operators currently work, so I am not sure what you mean by lack of SDK here.

We use it with Kubeflow Pipeline DSL

    spark_json_template = Template("""
{
    "apiVersion": "sparkoperator.k8s.io/v1beta2",
    "kind": "SparkApplication",
    "metadata": {
      "name": "hello-pipeline",
      "namespace": "kubeflow"},
    "spec": {
      "type": "Scala",
      "mode": "cluster",
      "mainApplicationFile": "$jar_location"
    }""")
    spark_json = spark_json_template.substitute({'jar_location': jar_location})
    spark_job = json.loads(spark_json)
    spark_resource = dsl.ResourceOp(
        name='spark-job',
        k8s_resource=spark_job,
        success_condition='status.state == Succeeded')
...

+1 on this issue, It will be great for SparkOperator to find a new home here

@droctothorpe
Copy link

droctothorpe commented Sep 13, 2023

@charlesa101 that's json with no customization, and the configuration options are abundant. It's nice to be able to just use ResourceOp though. Thanks for sharing.

Our platform provides both pyspark and Spark Operator support, and the overwhelming majority of users prefer pyspark. That's just one data point though. IMO, a proper, python interface ala the training operator SDK (or pyspark) would promote adoption.

@charlesa101
Copy link

@droctothorpe This is based on CRD for SparkOperator, This will work in the same way for PySpark. I'm curious to know more about how your PySpark Operator implementation works. The configurations are abundant but I am not sure there will be a use case for you to have to load up all the configs

I agree with you that It will be great to eventually align the behavior of this operator with the training operators to make it easy to use but I am not sure what you still mean by SDK in this context! Once you have the YAML and CRDs well-defined you can easily use them in your KFP as a component

@terrytangyuan
Copy link
Member

terrytangyuan commented Sep 13, 2023

Here's the Python SDK for training-operator. Basically instead of writing YAML and use it in your KFP component, you can use Python to define and submit jobs directly.

https://github.com/kubeflow/training-operator/tree/master/sdk/python

@charlesa101
Copy link

Oh I see what you mean, thanks @terrytangyuan 👍

@mwielgus
Copy link
Author

What should be the next steps? Do you have enough data points about Spark in Kubeflow?

@andreyvelich
Copy link
Member

Hi @mwielgus, please can you join Kubeflow Community call next Tuesday on October 31st 8:00am PST ?
We can discuss the next steps and possibilities to move this forward.

Also @thesuperzapper can share some details around using Spark with Kubeflow Notebooks 2.0 (e.g. Kubeflow Workspaces).

@mwielgus
Copy link
Author

@andreyvelich Yes, I will be there.

@andreyvelich
Copy link
Member

We had a great discussion around adoption of Spark Operator during KubeCon with @mwielgus and @vara-bonthu.
We might be able to find folks who can maintain this project moving forward.
Let's have a chat tomorrow during Kubeflow Community Call (November 14th at 8:00am PST).

@jbottum We will provide more updates during the call and discuss the next steps.

@andreyvelich
Copy link
Member

andreyvelich commented Nov 21, 2023

Hi Everyone, as we discussed on the latest Kubeflow community call we started this doc to donate Spark Operator to Kubeflow:
https://docs.google.com/document/d/1rCPEBQZPKnk0m7kcA5aHPf0fISl0MTAzsa4Wg3dfs5M/edit#heading=h.z7wqs2ebrwra
Please take a look and provide your comments.
It would be great if we could quickly discuss it during our today's Kubeflow Community Call at 8am PST (cc @mwielgus @vara-bonthu).

cc @kubeflow/project-steering-group @kubeflow/wg-pipeline-leads @kubeflow/wg-training-leads @kubeflow/wg-notebooks-leads

@vara-bonthu
Copy link
Contributor

I am looking forward to the adoption of the Google's Spark K8s Operator, which will contribute to building a larger community and potentially become the official Spark Operator for Apache Spark.

As part of this efforts, it is crucial to establish support for a single official Spark Kubernetes Operator within the Apache Spark community. Collaboration with Apache Spark and gaining their endorsement is of utmost importance in this context.

This collaboration will serve to prevent the Apache Spark community from introducing an entirely new Spark Operator, akin to Apache Flink, which offers an official Flink Operator for Kubernetes. This approach helps avoid potential confusion within the community and ensures that users gravitate toward the approved Apache Spark Operator tool.

@andreyvelich
Copy link
Member

cc @yuchaoran2011

@lfrancke
Copy link

If you want the operator to become even semi-"official" it should be donated to the ASF instead.
The ASF - in general - does not give any product the recognition of being the "official X for Y" or the "approved".
(I say this as a member of the ASF but not with any special knowledge or any special powers, just from my knowledge of the policies - especially around trademarks). https://www.apache.org/foundation/marks/

While we're at it: The current name "Google's Spark K8s Operator" might be a violation of the trademark policy already.
I suggest clarifying with the ASF before adopting the name.
The usual "approved" naming scheme is "XYZ for Apache Foo". In this case: Google's Kubernetes operator for Apache Spark" (or similar)
It needs to be made clear, in naming, documentation and communication that this is in no way officially affiliated with the ASF.

With my other hat - as a co-founder of Stackable I'd like to point to another operator for Apache Spark which already exists (built by us): https://github.com/stackabletech/spark-k8s-operator/ and which we recently compared to the Google one.

Happy to help with any ASF related communication.

@wilfred-s
Copy link

I agree with @lfrancke on the point of donating to the ASF if you want to make it even semi official.
In the Apache YuniKorn community we see a number of groups using the operator. Most of them have made changes to the operator to fix issues or integrate with newer versions of Apache Spark.

@terrytangyuan
Copy link
Member

If you want the operator to become even semi-"official" it should be donated to the ASF instead.

IMO, "official" should only be earned by merit and community adoption. Although donating to ASF helps the legal side, CNCF provides a good community around K8s and cloud-native technologies.

With my other hat - as a co-founder of Stackable I'd like to point to another operator for Apache Spark which already exists (built by us): https://github.com/stackabletech/spark-k8s-operator/ and which we recently compared to the Google one.

Out of curiosity, why not join the effort of maintaining the existing Spark Operator that's already widely adopted?

@thesuperzapper
Copy link
Member

I don't think this discussion is about trying to present the Google Spark operator as an "official" option (from a Spark or even a Kubeflow perspective), it's simply about giving a new home for the existing users and contributors of GoogleCloudPlatform/spark-on-k8s-operator on the Kubeflow org, so they can continue working on it in a neutral place, rather than continue struggling under their current home.

It's up to the maintainers of GoogleCloudPlatform/spark-on-k8s-operator to decide where they want to live, and in this specific case, it seems like they need a solution in the short-term solution to prevent those contributors/users from being stuck and unable to continue development.

Longer term, there is a strategic question about whether all three operators can be merged (including the Stackable one and the one that Apple was proposing to donate to the ASF), but I don't think that needs to block this donation, if all parties are willing.

@Jeffwan
Copy link
Member

Jeffwan commented Nov 23, 2023

Hi forks, long time no see due to busy internal work. I happen to see this thread. Few things to note

  1. spark-operator collaboration has been there for long term. If they need sponsorship, Kubeflow would be a perfect umbrella and it would gradually extends to DATA + AI scope.

  2. about "official", I am in Spark dev list and notice there's a proposal there SPIP : Spark Kubernetes Operator recently. Honestly, I think the GCP version is pretty good and widely used by numerous users and orgs. If kubeflow community can drive its evolvement, that help a lots of Spark users and may avoid reinventing wheels.

@vara-bonthu
Copy link
Contributor

vara-bonthu commented Nov 23, 2023

Matthew (@thesuperzapper) makes a good point - we are looking for a new home for Google's Spark Operator, and CNCF projects like Kubeflow seem like a good fit because they have a bigger community. But our main goal is to prevent Apache Spark from making another Java Spark Operator. Instead, we think it's important for everyone to work together on one Spark Operator.

@Jeffwan, you're right. We found a proposal that already has votes from Spark maintainers. But we added our comments to the proposal, saying that Google's Spark Operator is widely adopted by hundreds of organizations in production today. Salesforce and few others also added a "+" and said they think Google's Spark Operator is a good idea.

To make sure the Apache Spark community knows what we're thinking, we started a new proposal (SPIP) inside Apache Spark. You can find it here SPARK-46054.

Please share your thoughts and vote on the proposal. We want to work together on one Spark Operator, no matter if it ends up under Apache or Kubeflow. This will make the community bigger and stronger.

@wilfred-s @terrytangyuan with the support from your folks, we can work on endorsing one tool to build bigger community.

@lfrancke
Copy link

With my other hat - as a co-founder of Stackable I'd like to point to another operator for Apache Spark which already exists (built by us): stackabletech/spark-k8s-operator and which we recently compared to the Google one.

Out of curiosity, why not join the effort of maintaining the existing Spark Operator that's already widely adopted?

I don't want to derail this issue, so I'll try to keep it short.
Our use-case is different: We are building a platform which includes 10+ tools and operators (I recently gave a talk on our experience building a lot of operators). And for us it's important that all operators support the same features, consolidated documentation, CRD docs, vulnerability management, supply chain security stuff, Cyber Resilience Act compliance etc.
For that reason we made a decision to build our own operators to make sure they all are... similar. I hope that makes sense?

@vara-bonthu As mentioned before: It would be against the ASF rules for a project to "endorse" a project. So that is never going to happen if the project is not part of the ASF itself and even then the term "endorse" would almost certainly not be used.

@vara-bonthu
Copy link
Contributor

@vara-bonthu As mentioned before: It would be against the ASF rules for a project to "endorse" a project. So that is never going to happen if the project is not part of the ASF itself and even then the term "endorse" would almost certainly not be used.

@lfrancke Thank you for the clarification regarding the term "endorse."

To clarify our intent, we have a straightforward goal here. We are interested in investigating the potential donation of the Spark Operator to either the Apache or Kubeflow projects. Once such a donation is agreed upon, we are committed to aligning with and adhering to the governance policies and guidelines of the chosen organization.

We also aim to prevent the unnecessary duplication of efforts in building multiple Spark Operators, which can potentially lead to confusion among users and organizations.

@vikas-saxena02
Copy link

I am in support of this proposal as I have done a variety of usecases that require SparkML due to sheer volumes of data including near realtime scenarios using spark streaming. @andreyvelich @jbottum @akgraner I am more than happy to be part of this initiative as I have the right skills for the same.

@andreyvelich
Copy link
Member

It's great to hear @vikas-saxena02.
If you are available, please attend one of the upcoming Kubeflow Community Calls on Tuesday at 8am PST, so we can discuss the Spark Operator adoption updates.

@terrytangyuan
Copy link
Member

See https://github.com/kubeflow/spark-operator

/close

Copy link

@terrytangyuan: Closing this issue.

In response to this:

See https://github.com/kubeflow/spark-operator

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests