-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adoption of Spark-on-k8s-operator #648
Comments
cc: @terrytangyuan |
+1 happy to sponsor this. This would be a great addition to the Kubeflow community. cc @james-jwu @theadactyl cc @kubeflow/wg-training-leads |
Thank you for proposing this @mwielgus! I agree, that Spark operator might be useful for Kubeflow users who want to do Data Preparation, Feature extraction, Data Validation, etc. before building and training their ML models. Currently, Kubeflow doesn't offer such functionality. It would be nice if you could join our upcoming AutoML and Training WG Community call today (September 6th) at 6pm UTC (10am PST) to discuss the details and potential use-cases. cc @kubeflow/wg-training-leads @tenzen-y @kuizhiqing |
Is this proposal to haev Spark operator to be independent operator in Kubeflow? |
Thanks, I will join the meeting today :). |
Basically, SGTM. However, I have the same question as @johnugeorge said. |
FYI, The Kubeflow User Survey(s) have consistently shown that users would like a Spark / Kubeflow integration. |
We will discuss if and how Kubeflow will support a Spark K8s operator in our Community Meeting on Tuesday, please find the bridge in these meeting notes. I suspect there maybe several operators or implementations and we need to decide if we are going to pick one, how it will be supported, if it is part of a (new) Kubeflow Working Group, how it is installed, etc. @kimwnasptd @mwielgus Kubeflow community meeting notes: https://docs.google.com/document/d/1Wdxt1xedAj7qF_Rjmxy1R0NRdfv7UWs-r2PItewxHpE/edit. |
mwielgus we had a spark operator before. Are they using the modern sparkconnect? https://spark.apache.org/docs/latest/spark-connect-overview.html You can already use the kubernetes apiserver as spark master. So i am wondering whether that + spark-connect is alaready enough. Anyway i am open to contributions in manifests/contrib. |
Here is the recording for our initial discussion on Sep 9th around Spark Operator in Kubeflow: https://youtu.be/3D2h5OUNCQo. @mwielgus Please can you attend Kubeflow Community call today at 8:00am PST, so we can have a followup discussion around Spark Operator: https://docs.google.com/document/d/1Wdxt1xedAj7qF_Rjmxy1R0NRdfv7UWs-r2PItewxHpE/edit#heading=h.xtqde2br5mh4. cc @kubeflow/wg-training-leads |
@andreyvelich I will be there. |
Thank you, Marcin! |
As a follow-up to our recent Apache Spark discussions in the Kubeflow Community meetings, we are requesting some user input... If you are a Spark user or contributor, the Kubeflow Community would like to know if you need active support for a Spark Kubernetes Operator. If so, would you please comment or +1 on this GitHub issue. We need at least 10 users and would appreciate any ideas on use cases i.e. integration with notebooks or Kubeflow pipelines. Thanks! Josh |
IMO, the fundamental gap is the lack of an SDK. Data scientists would rather write python than yaml (for good reason). There needs to be (a) some clarification (and documentation) about the benefits of the spark operator over pyspark, and (b) development of an SDK (perhaps an extension to the training operator SDK). |
@droctothorpe, We currently use the SparkOperator in a few of our projects, It makes it easy for us to deploy Spark Jobs "natively" on K8s, more like how the training operators currently work, so I am not sure what you mean by lack of SDK here. We use it with Kubeflow Pipeline DSL
+1 on this issue, It will be great for SparkOperator to find a new home here |
@charlesa101 that's json with no customization, and the configuration options are abundant. It's nice to be able to just use ResourceOp though. Thanks for sharing. Our platform provides both pyspark and Spark Operator support, and the overwhelming majority of users prefer pyspark. That's just one data point though. IMO, a proper, python interface ala the training operator SDK (or pyspark) would promote adoption. |
@droctothorpe This is based on CRD for SparkOperator, This will work in the same way for PySpark. I'm curious to know more about how your PySpark Operator implementation works. The configurations are abundant but I am not sure there will be a use case for you to have to load up all the configs I agree with you that It will be great to eventually align the behavior of this operator with the training operators to make it easy to use but I am not sure what you still mean by SDK in this context! Once you have the YAML and CRDs well-defined you can easily use them in your KFP as a component |
Here's the Python SDK for training-operator. Basically instead of writing YAML and use it in your KFP component, you can use Python to define and submit jobs directly. https://github.com/kubeflow/training-operator/tree/master/sdk/python |
Oh I see what you mean, thanks @terrytangyuan 👍 |
What should be the next steps? Do you have enough data points about Spark in Kubeflow? |
Hi @mwielgus, please can you join Kubeflow Community call next Tuesday on October 31st 8:00am PST ? Also @thesuperzapper can share some details around using Spark with Kubeflow Notebooks 2.0 (e.g. Kubeflow Workspaces). |
@andreyvelich Yes, I will be there. |
We had a great discussion around adoption of Spark Operator during KubeCon with @mwielgus and @vara-bonthu. @jbottum We will provide more updates during the call and discuss the next steps. |
Hi Everyone, as we discussed on the latest Kubeflow community call we started this doc to donate Spark Operator to Kubeflow: cc @kubeflow/project-steering-group @kubeflow/wg-pipeline-leads @kubeflow/wg-training-leads @kubeflow/wg-notebooks-leads |
I am looking forward to the adoption of the Google's Spark K8s Operator, which will contribute to building a larger community and potentially become the official Spark Operator for Apache Spark. As part of this efforts, it is crucial to establish support for a single official Spark Kubernetes Operator within the Apache Spark community. Collaboration with Apache Spark and gaining their endorsement is of utmost importance in this context. This collaboration will serve to prevent the Apache Spark community from introducing an entirely new Spark Operator, akin to Apache Flink, which offers an official Flink Operator for Kubernetes. This approach helps avoid potential confusion within the community and ensures that users gravitate toward the approved Apache Spark Operator tool. |
If you want the operator to become even semi-"official" it should be donated to the ASF instead. While we're at it: The current name "Google's Spark K8s Operator" might be a violation of the trademark policy already. With my other hat - as a co-founder of Stackable I'd like to point to another operator for Apache Spark which already exists (built by us): https://github.com/stackabletech/spark-k8s-operator/ and which we recently compared to the Google one. Happy to help with any ASF related communication. |
I agree with @lfrancke on the point of donating to the ASF if you want to make it even semi official. |
IMO, "official" should only be earned by merit and community adoption. Although donating to ASF helps the legal side, CNCF provides a good community around K8s and cloud-native technologies.
Out of curiosity, why not join the effort of maintaining the existing Spark Operator that's already widely adopted? |
I don't think this discussion is about trying to present the Google Spark operator as an "official" option (from a Spark or even a Kubeflow perspective), it's simply about giving a new home for the existing users and contributors of It's up to the maintainers of Longer term, there is a strategic question about whether all three operators can be merged (including the Stackable one and the one that Apple was proposing to donate to the ASF), but I don't think that needs to block this donation, if all parties are willing. |
Hi forks, long time no see due to busy internal work. I happen to see this thread. Few things to note
|
Matthew (@thesuperzapper) makes a good point - we are looking for a new home for Google's Spark Operator, and CNCF projects like Kubeflow seem like a good fit because they have a bigger community. But our main goal is to prevent Apache Spark from making another Java Spark Operator. Instead, we think it's important for everyone to work together on one Spark Operator. @Jeffwan, you're right. We found a proposal that already has votes from Spark maintainers. But we added our comments to the proposal, saying that Google's Spark Operator is widely adopted by hundreds of organizations in production today. Salesforce and few others also added a "+" and said they think Google's Spark Operator is a good idea. To make sure the Apache Spark community knows what we're thinking, we started a new proposal (SPIP) inside Apache Spark. You can find it here SPARK-46054. Please share your thoughts and vote on the proposal. We want to work together on one Spark Operator, no matter if it ends up under Apache or Kubeflow. This will make the community bigger and stronger. @wilfred-s @terrytangyuan with the support from your folks, we can work on endorsing one tool to build bigger community. |
I don't want to derail this issue, so I'll try to keep it short. @vara-bonthu As mentioned before: It would be against the ASF rules for a project to "endorse" a project. So that is never going to happen if the project is not part of the ASF itself and even then the term "endorse" would almost certainly not be used. |
@lfrancke Thank you for the clarification regarding the term "endorse." To clarify our intent, we have a straightforward goal here. We are interested in investigating the potential donation of the Spark Operator to either the Apache or Kubeflow projects. Once such a donation is agreed upon, we are committed to aligning with and adhering to the governance policies and guidelines of the chosen organization. We also aim to prevent the unnecessary duplication of efforts in building multiple Spark Operators, which can potentially lead to confusion among users and organizations. |
I am in support of this proposal as I have done a variety of usecases that require SparkML due to sheer volumes of data including near realtime scenarios using spark streaming. @andreyvelich @jbottum @akgraner I am more than happy to be part of this initiative as I have the right skills for the same. |
It's great to hear @vikas-saxena02. |
@terrytangyuan: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We are looking for a new home for Spark-on-k8s-operator. The project was quite active for years, delivering a convenient way of running Spark in the Kubernetes environment. Unfortunately, due to some org changes, the previous maintainers are unable to provide enough time and love the project and its users deserve. So, GoogleCloudPlatform would like to transfer ownership of the code (already on the Apache license) to an organisation that would help to bring more life to the project and continue to help users run Spark on K8S. Given that you support a wide variety of ML/batch frameworks (MPI, TF, Pytorch etc) we think that Kubeflow would be a good place for the Spark operator.
The text was updated successfully, but these errors were encountered: