Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: 增加节点可迁移的接口 #42

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
node_modules
node_modules
.idea
27 changes: 27 additions & 0 deletions protos/config.proto
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,29 @@ message GetClusterNodesInfoResponse {
repeated NodeInfo nodes = 1;
}

message MigrateNodeInfo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此消息只用来表示一个节点是否可以被迁移,而考虑到由SCOW负责迁移的实现,可以理解成SCOW只需要知道 一个节点是否可以被移除集群。此信息可以直接加在NodeInfo中为一个bool removable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,这样实现比较优雅。但是就需要scow来判断哪些节点可以再哪些集群间相互迁移。
比如,scow管理slurm01、slurm02、ai01、ai02四个集群。这四个集群的节点分别如下:

slurm01  [node01、node02、node03]
slurm02  [node04、node05、node06]
ai01     [node01、node02、node07]
ai02     [node04、node05、node08]

此时需要scow来识别出,node01、node02节点可以在slurm01集群和ai01集群间相互迁移;node04、node05节点可以在slurm02集群和ai02集群间相互迁移;

Copy link
Member

@ddadaal ddadaal Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCOW管所有集群,知道所有集群的状态,并且由SCOW发起迁移,确实应该由SCOW判断哪些节点可以在哪些集群之间迁移,接口返回足够的信息让SCOW做判断即可


enum NodeState {
UNKNOWN = 0;
MIGRATABLE = 1;
NOT_MIGRATABLE = 2;
}

string node_name = 1;
repeated string partitions = 2;
NodeState state = 3;
string cluster_name = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一个适配器部署在一个集群中,所有如果对一个适配器发起请求,调用者一定知道这是在对哪个集群发起请求,所以不需要返回集群本身的名字

repeated string migratable_clusters = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在描述里没有看到这个字段是什么意思?按名字理解,是指这个节点可以迁移到哪些集群里去吗?此字段暗示了各个适配器应该知道其他集群的信息,应避免各个适配器之间的依赖

}

message GetClusterMigrateNodesInfoRequest {

}

message GetClusterMigrateNodesInfoResponse {
repeated MigrateNodeInfo nodes = 1;
}

message ListImplementedOptionalFeaturesRequest {}

enum OptionalFeatures {
Expand Down Expand Up @@ -163,6 +186,10 @@ service ConfigService {
* description: get cluster nodes information
*/
rpc GetClusterNodesInfo(GetClusterNodesInfoRequest) returns (GetClusterNodesInfoResponse);
/*
* description: get cluster migrate nodes information
*/
rpc GetClusterMigrateNodesInfo(GetClusterMigrateNodesInfoRequest) returns (GetClusterMigrateNodesInfoResponse);
/*
* description: List optional features implemented by this scheduler adapter
*/
Expand Down
36 changes: 36 additions & 0 deletions protos/node.proto
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
/**
* Copyright (c) 2022 Peking University and Peking University Institute for Computing and Digital Economy
* SCOW is licensed under Mulan PSL v2.
* You can use this software according to the terms and conditions of the Mulan PSL v2.
* You may obtain a copy of Mulan PSL v2 at:
* http://license.coscl.org.cn/MulanPSL2
* THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
* EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT,
* MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE.
* See the Mulan PSL v2 for more details.
*/

syntax = "proto3";

package scow.scheduler_adapter;

message MigrateNodeRequest {
string node_name = 1;
repeated string destination_partitions = 3;
string origin_cluster_name = 4;
string destination_cluster_name = 5;
Comment on lines +20 to +21
Copy link
Member

@ddadaal ddadaal Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当一个集群被调用MigrateNode RPC时具体发生了什么?一个适配器应该只管理自己集群,为何需要知道要迁移到的目标集群?这样的话,各个适配器之前产生了依赖,不是一个好的实践。各个集群之间的交互应该由适配器的调用者(SCOW)来实现。

举个例子:假设每个节点保存了自己所在集群的信息,迁移节点到集群是指将节点自己记录的所在集群的信息修改为另一个集群,那么正确的设计应该是:

  • 增加两个RPC
    • RemoveNodeFromCluster:从当前集群中移除一个节点,实现为清空此节点记录的自己所在集群的信息
    • AddNodeToCluster:往当前集群中增加一个节点,实现为给此节点记录自己所在集群的信息
  • 迁移逻辑为:SCOW首先在一个集群中调用RemoveNodeFromCluster,然后在另一个集群中调用AddNodeToCluster

}

message MigrateNodeResponse {

}

service NodeService {
/*
* description: migrate node
* errors:
* - node not found
* NOT_FOUND, NODE_NOT_FOUND, {}
*/
rpc MigrateNode(MigrateNodeRequest) returns (MigrateNodeResponse);
}
Loading