-
Notifications
You must be signed in to change notification settings - Fork 549
cntk-test-19844-17854 job failed #1396
Comments
requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=1, maxGPUs=0 The maximum GPU is 0 in your cluster. Please make sure this configure entry yarn.scheduler.maximum-allocation-gpus set correctly. |
I have same problem! |
|
I have found two yarn-site.xml files have "yarn.scheduler.maximum-allocation-gpus" configure entry in dev-box/pai/pai-management/bootstrap, did you mean one of these? The value in two yarn-site.xml are both "8". |
you can open http://yarn_resource_manager_address:8088/conf to get all the live configurations. |
<property>
<name>yarn.scheduler.maximum-allocation-gpus</name>
<value>8</value>
<final>false</final>
<source>yarn-site.xml</source>
</property> The value is 8. Infect I have 3 nodes and each node has 1 GPU. |
If GPUs total is 0, it means the Hadoop-node-manager has some issue to report the GPUs. |
I downloaded some Hadoop-node-manange log from k8s web portal: https://github.com/leonfg/leonfg.github.io/blob/master/pai/logs-from-hadoop-node-manager.7z |
You can put the log to pastebin for easy access. |
Seems no GPUs registered. 18/09/19 07:02:25 INFO nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as 192.168.11.201:8041 with total resource of <memory:51200, vCores:12, GPUs:0, GPUAttribute:0, ports: [100-630],[632-1079],[1081-4193],[4195-5352],[5354-6009],[6011-8117],[8119-10247],[10251-10254],[10257-33721],[33723-33733],[33735-42251],[42253-45687],45689,45691,45693,[45695-45697],45699,[45701-45715],45717,[45719-46541],[46543-50009],[50011-50019],[50021-50074],[50076-58971],[58973-59328],[59330-65535]> |
Thanks! Then how to register GPU? Is it related to the cluster-configuration.yaml? I have modified the yaml file as follows before deploy PAI.
|
Thanks, The log is enough for us to detect the issue . will let you updated soon |
I think I find the root cause. It is caused by the nvidia-smi tool output has different format with all we tested. @leonfg, you can Enable your GPU's ECC to workaround this issue. and mean time, We will enhance the GPU detection code in Node Manager to cover this scenario. |
Thanks very much, I will test it tomorrow. |
If the ECC enabled, this value will be 0 for good, 1 for error, N/A for no ECC feature. |
I enabled ECC and now everything is ok! Thank you! +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ |
Hi has an alternative to ECC been identified for this problem. I have two GPU's (GTX) which do not support ECC configuration. |
job:
{
"jobName": "cntk-test-19844-17854",
"image": "aiplatform/pai.run.cntk",
"dataDir": "hdfs://192.168.11.202:9000/Test/cntk/Data",
"outputDir": "hdfs://192.168.11.202:9000/Test/cntk/cntk-test-19844-17854",
"codeDir": "hdfs://192.168.11.202:9000/Test/cntk/BrainScript",
"taskRoles": [
{
"name": "g2p_train",
"taskNumber": 1,
"cpuNumber": 4,
"memoryMB": 8196,
"gpuNumber": 1,
"portList": [
{
"label": "web",
"beginAt": 0,
"portNumber": 1
},
{
"label": "grpc",
"beginAt": 0,
"portNumber": 1
}
],
"command": "cd BrainScript && /bin/bash cntk.sh"
}
]
}
log:
The text was updated successfully, but these errors were encountered: