Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

schedule task by GpuType #1416

Closed
feng257 opened this issue Sep 20, 2018 · 13 comments
Closed

schedule task by GpuType #1416

feng257 opened this issue Sep 20, 2018 · 13 comments
Assignees

Comments

@feng257
Copy link

feng257 commented Sep 20, 2018

Hi,
I submit a task which contains GpuType, but I find PAI doesn't schedule the task by GpuType.
I get into the nodemanager pod, and find userlog. According to the frameworklauncher code, I use key word "NodeGpuType" and find this log:

2018-09-20 01:35:54,697 WARN [pool-1-thread-1] com.microsoft.frameworklauncher.applicationmaster.SelectionManager: Configured Nodes is not found in ClusterConfiguration: Ignore Request NodeGpuType: [TIANXP]

So I use curl command curl http://ip:9086/v1/LauncherRequest/ClusterConfiguration to check clusterconfiguration, and it returns:

{
  "nodes" : null
}

So I want to know does the value GpuType in task JSON relate to machine-type in cluster-configuration.yaml?
thx.

@qinchen123
Copy link
Contributor

Hi Feng, There is a known bug in the GpuType scheduling which cause this feature currently doesn't work, we are fixing it.

@feng257
Copy link
Author

feng257 commented Sep 20, 2018

Thanks for your reply!
Do you guys find what cause this bug?
Can you tell me the schedule process use gputype as your design? For example, gputype in JSON corresponds to which value in the clusterconfiguration.

@qinchen123
Copy link
Contributor

The root cause of this bug is, the node-GPU type information was not upload the ectd server. So in the runtime, It couldn't get GPU type information :(

A work around you may can try it use ClusterConfiguration API to set this setting into ectd manually :)

@fanyangCS
Copy link
Contributor

#783

@feng257
Copy link
Author

feng257 commented Sep 20, 2018

I'm a little bit confused, frameworklauncher communicates with ZOOKEEPER as I know.
And filter node operation in AM(Application Master), so AM should know what gputype the node is.
I have 3 questions:

  1. As you said gputype is saved in ETCD server, does it mean AM will get the gputype from etcd when it filter node by gputype?
  2. Does the ClusterConfiguration API means http://ip:9086/v1/LauncherRequest/ClusterConfiguration?
  3. what is the format of clusterconfiguration?
    for example, I have a node config like this:
machine-sku: 
 NC24R:
    mem: 128
    gpu:
      type: teslak80
      count: 4
    cpu:
      vcore: 24
    #dataFolder: "/mnt"
    #Note: Up to now, the only supported os version is Ubuntu16.04. Please do not change it here.
    os: ubuntu16.04

machine-list:
    - hostname: gpu103
      hostip: 192.168.6.103
      machine-type: NC24R
      sshport: 22
      username: root
      password: root
      k8s-role: worker
      pai-worker: "true"

so the clusterconfiguration should be like this:

nodes:
  gpu103: {gpuType: teslak80}

@qinchen123
Copy link
Contributor

You are right, this gpu-tpye information should store in zookeeper.

  1. in current version, the gpu type is stored in Zookeeper, the AM will download the gpu type information before scheduling.
  2. the API is “put http://ip:9086/v1/LauncherRequest/ClusterConfiguration"
    actually, there is a bug to call "put http://ip:9086/v1/LauncherRequest/ClusterConfiguration" caused this function doesn't work currently.

@feng257
Copy link
Author

feng257 commented Sep 20, 2018

So, the method you provide can't work right now? (=.=)!

@qinchen123
Copy link
Contributor

The method is work, the issue the module call this method forget to call it :(. If you use resetAPI call to set this file, it should work.

@qinchen123
Copy link
Contributor

you can find the put Jason format template https://github.com/Microsoft/pai/blob/master/src/cluster-configuration/deploy/gpu-configuration/gpu-configuration.json.template
or try to find the gpu-configuration.json file in your image forder after build.

@feng257
Copy link
Author

feng257 commented Sep 20, 2018

OK, I see. the format is the configmap of gpu-configuration and I succeed. Thanks for your help.
One more question:
What is the plan that you guys fix this bug?

@fanyangCS
Copy link
Contributor

@DongZhaoYu , please reference your fix in this issue and close it.

@DongZhaoYu
Copy link
Member

#1562

@DongZhaoYu
Copy link
Member

#1416

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants