Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: 增加节点可迁移的接口 #42

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

283713406
Copy link

No description provided.

@283713406 283713406 requested a review from vanstriker December 19, 2024 06:21
@283713406
Copy link
Author

节点迁移需求:见https://github.com/PKUHPC/scow-internal-dev/issues/640 issue

示例如下:

假设环境初始时,node1、node2在hpc01集群使用,node2有作业在运行;node3在ai01集群使用
hpc01的GetClusterMigrateNodesInfo返回

{
    {
        node_name: node1,
        partitions: [compute];
        state: MIGRATABLE,
        cluster_name: hpc01,
    },
    {
        node_name: node2,
        partitions: [compute];
        state: NOT_MIGRATABLE,
        cluster_name: hpc01,
    },
}

ai01的GetClusterMigrateNodesInfo返回

{
    {
        node_name: node3,
        partitions: [gpu-queue];
        state: MIGRATABLE,
        cluster_name: ai01,
    },
}

对node1执行迁移

MigrateNodeRequest: {
    node_name: node1;
    destination_partitions: [cpu-queue];
    origin_cluster_name: hpc01;
    destination_cluster_name: ai01;
}

迁移完后
hpc01的GetClusterMigrateNodesInfo返回

{
    {
        node_name: node2,
        partitions: [compute];
        state: NOT_MIGRATABLE,
        cluster_name: hpc01,
    },
}

ai01的GetClusterMigrateNodesInfo返回

{
    {
        node_name: node1,
        partitions: [cpu-queue];
        state: MIGRATABLE,
        cluster_name: hpc01,
    },
    {
        node_name: node3,
        partitions: [gpu-queue];
        state: MIGRATABLE,
        cluster_name: ai01,
    },
}

Copy link
Member

@ddadaal ddadaal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一个适配器应该管理且只管理自己所在的集群,不应知道其他集群的任何信息。详情看MigrateNode rpc的评价。

string node_name = 1;
repeated string partitions = 2;
NodeState state = 3;
string cluster_name = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一个适配器部署在一个集群中,所有如果对一个适配器发起请求,调用者一定知道这是在对哪个集群发起请求,所以不需要返回集群本身的名字

repeated string partitions = 2;
NodeState state = 3;
string cluster_name = 4;
repeated string migratable_clusters = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在描述里没有看到这个字段是什么意思?按名字理解,是指这个节点可以迁移到哪些集群里去吗?此字段暗示了各个适配器应该知道其他集群的信息,应避免各个适配器之间的依赖

Comment on lines +20 to +21
string origin_cluster_name = 4;
string destination_cluster_name = 5;
Copy link
Member

@ddadaal ddadaal Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当一个集群被调用MigrateNode RPC时具体发生了什么?一个适配器应该只管理自己集群,为何需要知道要迁移到的目标集群?这样的话,各个适配器之前产生了依赖,不是一个好的实践。各个集群之间的交互应该由适配器的调用者(SCOW)来实现。

举个例子:假设每个节点保存了自己所在集群的信息,迁移节点到集群是指将节点自己记录的所在集群的信息修改为另一个集群,那么正确的设计应该是:

  • 增加两个RPC
    • RemoveNodeFromCluster:从当前集群中移除一个节点,实现为清空此节点记录的自己所在集群的信息
    • AddNodeToCluster:往当前集群中增加一个节点,实现为给此节点记录自己所在集群的信息
  • 迁移逻辑为:SCOW首先在一个集群中调用RemoveNodeFromCluster,然后在另一个集群中调用AddNodeToCluster

@@ -135,6 +135,29 @@ message GetClusterNodesInfoResponse {
repeated NodeInfo nodes = 1;
}

message MigrateNodeInfo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此消息只用来表示一个节点是否可以被迁移,而考虑到由SCOW负责迁移的实现,可以理解成SCOW只需要知道 一个节点是否可以被移除集群。此信息可以直接加在NodeInfo中为一个bool removable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,这样实现比较优雅。但是就需要scow来判断哪些节点可以再哪些集群间相互迁移。
比如,scow管理slurm01、slurm02、ai01、ai02四个集群。这四个集群的节点分别如下:

slurm01  [node01、node02、node03]
slurm02  [node04、node05、node06]
ai01     [node01、node02、node07]
ai02     [node04、node05、node08]

此时需要scow来识别出,node01、node02节点可以在slurm01集群和ai01集群间相互迁移;node04、node05节点可以在slurm02集群和ai02集群间相互迁移;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants