Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proto] Add core api definitions as protobuf message #93

Merged
merged 2 commits into from
Dec 6, 2021

Conversation

Jeffwan
Copy link
Collaborator

@Jeffwan Jeffwan commented Nov 2, 2021

Why are these changes needed?

Related issue number

Part of #53 #82

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

This is just api definition and we will test changes later with business logic together

- Add cluster, config and error api definition
@Jeffwan Jeffwan changed the title Add core api definitions in protobuf [proto] Add core api definitions in protobuf Nov 2, 2021
@Jeffwan Jeffwan changed the title [proto] Add core api definitions in protobuf [proto] Add core api definitions as protobuf message Nov 2, 2021
@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Nov 2, 2021

/cc @chenk008 @akanso This is to support the backend service on top of ray-operator. We need some definition to generate gPRC services. Please have a check

proto/cluster.proto Outdated Show resolved Hide resolved
// Required. Group name of the current worker group
string group_name = 1;
// Optional. The computeTemplate of head node group
string compute_template = 2;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenk008 I change to string instead. WDYT? We can make this concept persisted and resuable. If we use message ComputeTemplate, user need to fetch the object as well and they may modified the fields. I feel using reference key is easier

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me.

google.protobuf.Timestamp deleted_at = 8;
}

message ClusterSpec {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There're only HeadGroupSpec and WorkerGroupSpec object under ClusterSpec, do you think this is helpful? We can also move HeadGroupSpec and WorkerGroupSpec directly to message Cluster. I am ok with both way @chenk008

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ,we can simplify it to move HeadGroupSpec and WorkerGroupSpec into Cluster.

// Required field. This field indicates ray cluster configuration
ClusterSpec cluster_spec = 6;

// Output. The time that the Cluster created.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our downsteam version, we expose a service endpoint here. However, I feel it's kind of hard to write an endpoint since service expose dashboard, client, redis port. What's your opinion? @chenk008

Copy link
Contributor

@chenk008 chenk008 Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is required to expose head service endpoint. The devops platform will show the information.

@akanso
Copy link
Collaborator

akanso commented Nov 3, 2021

this PR only shows the GRPC part, we need to discuss the GRPC interface, and whether it will be optional to expose and how? (via a flag, --grpc-interface=false)?

The second point I have is about the image template, it seems to me we are mixing pipeline builds with operations. Can you please clarify why image template is part of an operator API?

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Nov 3, 2021

this PR only shows the GRPC part, we need to discuss the GRPC interface, and whether it will be optional to expose and how? (via a flag, --grpc-interface=false)?

This would a separate backend service deployment. For people who are not familiar with kubernetes, they can use this service. the gprc service definition and gateway part would be a separate PR. @chenk008 gave some suggestion to split story into smaller chunks so people can pick them up from the issue list

Can you please clarify why image template is part of an operator API?

@akanso I think this won't be operator API, but API on top of operator. We have a use case that data scientist doesn't like to build docker images by themselves, they like a tool/api to build the images so that don't have to write Dockerfile and manage container registry credential etc.

The proto message here is to define API. Technically, we can build a gPRC service on top of it. the service can kick off a job to build images inside Kubernetes and automatically handles image uploads. In that case, user just need to let the service know which python package they need and it simplifies the user experiences.

@akanso
Copy link
Collaborator

akanso commented Nov 3, 2021

I think we need a design doc for this feature.
user API -->[utility service] --> docker image. --> /used in/ --> KubeRay Custom resource yaml --> /consumed by/ --> KubeRay
just a couple of pages to motivate the work and some diagrams indicating the flow of events

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Nov 4, 2021

@akanso That's a good idea. I will draft one and share with you guys.

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Nov 25, 2021

I think cluster related definition has corresponding design doc in #98

For the image part, it may worth creating a separate doc later. Currently, what I am thinking is to leverage https://github.com/GoogleContainerTools/kaniko + Docker Template to build images and managed tags/versions/ registry credential etc so user just need to fill the template without worrying about other stuff.

  • REST client -> gRPC service -> Kubernetes Job (Kaniko Job) -- Async way
  • CLI -> gPRC service -> Kubernetes Job (Kaniko Job) -- Sync way

Steps

  1. User inputs payload
        {
            "name": "ray1.7-torch1.9-tensorflow2.3-gpu",
            "base_image": "ray-project/ray:1.7.0-gpu",
            "pip_packages": [
                "tensorflow==2.3.0",
                "torch==1.9.0"
            ],
            "system_packages": ["vim", "git"],
            "environment_variables": {
                "CLASSPATH": "/opt/lib/yarn_deploy/hadoop/bin/hadoop classpath --glob",
                "HADOOP_HOME": "/opt/lib/yarn_deploy/hadoop",
                "LD_LIBRARY_PATH": "xxx",
             },
            "custom_commands": "",
        },
  1. backend construct payload and template dockerfile to a final Dockerfile to build

Template:

	tmpl := `FROM {{ .BaseImage}}
RUN apt update && apt install -y {{ StringsJoin .SystemPackages " " }}
{{ if .PipPackages }}
RUN pip install {{ StringsJoin .PipPackages " " }}
{{ end }}
{{ if .CondaPackages }}
RUN conda install {{ StringsJoin .CondaPackages " " }}
{{ end }}
{{ range $key, $value := .EnvironmentVariables }}
ENV {{ $key}} {{ $value }}
{{ end }}

RUN {{ .CustomCommands }}
`
  1. Run Jobs
 Name:  "image-builder",
 Image: "hub.byted.org/kuberay/kaniko-executor:v1.7.0",
 Command: []string{
         "/kaniko/executor",
 },

 Args: []string{
         "--dockerfile=/build/images/cluster-image/Dockerfile",
         "--context=dir:///build/images/cluster-image/Dockerfile",
        # "--destination=${REGISTRY}/${REPO}:${TAG}",
        "--destination=seedjeffwan/kuberay-custom-image-ray1.7-torch1.8-tf.23:0.1",

@Jeffwan Jeffwan merged commit 3f17003 into ray-project:master Dec 6, 2021
@Jeffwan Jeffwan deleted the proto_api branch December 6, 2021 09:49
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
* Add protobuf core definitions

- Add cluster, config and error api definition

* Update protobuf definitions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants