-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: 增加节点可迁移的接口 #42
base: master
Are you sure you want to change the base?
feat: 增加节点可迁移的接口 #42
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,2 @@ | ||
node_modules | ||
node_modules | ||
.idea |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -135,6 +135,29 @@ message GetClusterNodesInfoResponse { | |
repeated NodeInfo nodes = 1; | ||
} | ||
|
||
message MigrateNodeInfo { | ||
|
||
enum NodeState { | ||
UNKNOWN = 0; | ||
MIGRATABLE = 1; | ||
NOT_MIGRATABLE = 2; | ||
} | ||
|
||
string node_name = 1; | ||
repeated string partitions = 2; | ||
NodeState state = 3; | ||
string cluster_name = 4; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 一个适配器部署在一个集群中,所有如果对一个适配器发起请求,调用者一定知道这是在对哪个集群发起请求,所以不需要返回集群本身的名字 |
||
repeated string migratable_clusters = 5; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 在描述里没有看到这个字段是什么意思?按名字理解,是指这个节点可以迁移到哪些集群里去吗?此字段暗示了各个适配器应该知道其他集群的信息,应避免各个适配器之间的依赖 |
||
} | ||
|
||
message GetClusterMigrateNodesInfoRequest { | ||
|
||
} | ||
|
||
message GetClusterMigrateNodesInfoResponse { | ||
repeated MigrateNodeInfo nodes = 1; | ||
} | ||
|
||
message ListImplementedOptionalFeaturesRequest {} | ||
|
||
enum OptionalFeatures { | ||
|
@@ -163,6 +186,10 @@ service ConfigService { | |
* description: get cluster nodes information | ||
*/ | ||
rpc GetClusterNodesInfo(GetClusterNodesInfoRequest) returns (GetClusterNodesInfoResponse); | ||
/* | ||
* description: get cluster migrate nodes information | ||
*/ | ||
rpc GetClusterMigrateNodesInfo(GetClusterMigrateNodesInfoRequest) returns (GetClusterMigrateNodesInfoResponse); | ||
/* | ||
* description: List optional features implemented by this scheduler adapter | ||
*/ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
/** | ||
* Copyright (c) 2022 Peking University and Peking University Institute for Computing and Digital Economy | ||
* SCOW is licensed under Mulan PSL v2. | ||
* You can use this software according to the terms and conditions of the Mulan PSL v2. | ||
* You may obtain a copy of Mulan PSL v2 at: | ||
* http://license.coscl.org.cn/MulanPSL2 | ||
* THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, | ||
* EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, | ||
* MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE. | ||
* See the Mulan PSL v2 for more details. | ||
*/ | ||
|
||
syntax = "proto3"; | ||
|
||
package scow.scheduler_adapter; | ||
|
||
message MigrateNodeRequest { | ||
string node_name = 1; | ||
repeated string destination_partitions = 3; | ||
string origin_cluster_name = 4; | ||
string destination_cluster_name = 5; | ||
Comment on lines
+20
to
+21
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 当一个集群被调用MigrateNode RPC时具体发生了什么?一个适配器应该只管理自己集群,为何需要知道要迁移到的目标集群?这样的话,各个适配器之前产生了依赖,不是一个好的实践。各个集群之间的交互应该由适配器的调用者(SCOW)来实现。 举个例子:假设每个节点保存了自己所在集群的信息,迁移节点到集群是指将节点自己记录的所在集群的信息修改为另一个集群,那么正确的设计应该是:
|
||
} | ||
|
||
message MigrateNodeResponse { | ||
|
||
} | ||
|
||
service NodeService { | ||
/* | ||
* description: migrate node | ||
* errors: | ||
* - node not found | ||
* NOT_FOUND, NODE_NOT_FOUND, {} | ||
*/ | ||
rpc MigrateNode(MigrateNodeRequest) returns (MigrateNodeResponse); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
此消息只用来表示一个节点是否可以被迁移,而考虑到由SCOW负责迁移的实现,可以理解成SCOW只需要知道 一个节点是否可以被移除集群。此信息可以直接加在NodeInfo中为一个
bool removable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,这样实现比较优雅。但是就需要scow来判断哪些节点可以再哪些集群间相互迁移。
比如,scow管理slurm01、slurm02、ai01、ai02四个集群。这四个集群的节点分别如下:
此时需要scow来识别出,node01、node02节点可以在slurm01集群和ai01集群间相互迁移;node04、node05节点可以在slurm02集群和ai02集群间相互迁移;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SCOW管所有集群,知道所有集群的状态,并且由SCOW发起迁移,确实应该由SCOW判断哪些节点可以在哪些集群之间迁移,接口返回足够的信息让SCOW做判断即可