cntk-test-19844-17854 job failed #1396

LMQ12345 · 2018-09-18T02:25:07Z

job:
{
"jobName": "cntk-test-19844-17854",
"image": "aiplatform/pai.run.cntk",
"dataDir": "hdfs://192.168.11.202:9000/Test/cntk/Data",
"outputDir": "hdfs://192.168.11.202:9000/Test/cntk/cntk-test-19844-17854",
"codeDir": "hdfs://192.168.11.202:9000/Test/cntk/BrainScript",
"taskRoles": [
{
"name": "g2p_train",
"taskNumber": 1,
"cpuNumber": 4,
"memoryMB": 8196,
"gpuNumber": 1,
"portList": [
{
"label": "web",
"beginAt": 0,
"portNumber": 1
},
{
"label": "grpc",
"beginAt": 0,
"portNumber": 1
}
],
"command": "cd BrainScript && /bin/bash cntk.sh"
}
]
}
log:

User:	admin
[cntk-test-19844-17854][10][UNKNOWN][172.17.0.2][UNKNOWN]
LAUNCHER

0 (Higher Integer value indicates higher priority)
FINISHED
default
FAILED
Tue Sep 18 02:05:41 +0000 2018
3sec
History
SUCCEEDED
Unlimited
[ExitStatus]: AM_INTERNAL_NON_TRANSIENT_ERROR [ExitCode]: 184 [ExitDiagnostics]: AM internal non-transient error [ExitType]: NON_TRANSIENT ________________________________________________________________________________________________________________________________________________________________________________________________________ [ExitCustomizedDiagnostics]: onError called into AM from RM due to non-transient error, maybe application is non-compliant. Exception: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=1, maxGPUs=0 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:284) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:231) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:247) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:246) at org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:214) at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:388) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2603) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy8.allocate(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:312) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:270) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException): Invalid resource request, requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=1, maxGPUs=0 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:284) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:231) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:247) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:246) at org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:214) at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:388) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2603) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1493) at org.apache.hadoop.ipc.Client.call(Client.java:1439) at org.apache.hadoop.ipc.Client.call(Client.java:1349) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy7.allocate(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) ... 12 more
false

DongZhaoYu · 2018-09-18T02:55:43Z

requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=1, maxGPUs=0

The maximum GPU is 0 in your cluster. Please make sure this configure entry yarn.scheduler.maximum-allocation-gpus set correctly.

leonfg · 2018-09-18T06:52:11Z

requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=1, maxGPUs=0

The maximum GPU is 0 in your cluster. Please make sure this configure entry yarn.scheduler.maximum-allocation-gpus set correctly.

I have same problem!
In my environment, all PAI services are running OK, nvidia-smi tool can get correct GPU information in each driver container. I can see GPU number and GPU memory usage history in PAI_ClusterView page. So I think all these mean my GPUs are available.
Where can I configure the yarn.scheduler.maximum-allocation-gpus？

LMQ12345 · 2018-09-18T07:37:49Z

requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=1, maxGPUs=0

The maximum GPU is 0 in your cluster. Please make sure this configure entry yarn.scheduler.maximum-allocation-gpus set correctly.
I am not familiar with yarn. Where can I configure the yarn.scheduler.maximum-allocation-gpus？

leonfg · 2018-09-19T01:10:45Z

requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=1, maxGPUs=0

The maximum GPU is 0 in your cluster. Please make sure this configure entry yarn.scheduler.maximum-allocation-gpus set correctly.

I have found two yarn-site.xml files have "yarn.scheduler.maximum-allocation-gpus" configure entry in dev-box/pai/pai-management/bootstrap, did you mean one of these? The value in two yarn-site.xml are both "8".

DongZhaoYu · 2018-09-19T03:37:04Z

you can open http://yarn_resource_manager_address:8088/conf to get all the live configurations.
And search yarn.scheduler.maximum-allocation-gpus to see its current value

leonfg · 2018-09-19T03:50:04Z

you can open http://yarn_resource_manager_address:8088/conf to get all the live configurations.
And search yarn.scheduler.maximum-allocation-gpus to see its current value

<property>
  <name>yarn.scheduler.maximum-allocation-gpus</name>
  <value>8</value>
  <final>false</final>
  <source>yarn-site.xml</source>
</property>

The value is 8. Infect I have 3 nodes and each node has 1 GPU.
In grafana dashborad all GPUs are visible, but in http://yarn_resource_manager_address:8088/cluster/nodes, GPUs Total is 0.

qinchen123 · 2018-09-19T06:42:08Z

If GPUs total is 0, it means the Hadoop-node-manager has some issue to report the GPUs.
Can you grab some Hadoop-node-mananger log ?

leonfg · 2018-09-19T07:20:57Z

If GPUs total is 0, it means the Hadoop-node-manager has some issue to report the GPUs.
Can you grab some Hadoop-node-mananger log ?

I downloaded some Hadoop-node-manange log from k8s web portal: https://github.com/leonfg/leonfg.github.io/blob/master/pai/logs-from-hadoop-node-manager.7z
If these are not what you want, please give me some instructions on how to get the log.

DongZhaoYu · 2018-09-19T08:23:44Z

You can put the log to pastebin for easy access.
I put the above logs here:
https://paste.ubuntu.com/p/Zzg8Yjd5Hw/
https://paste.ubuntu.com/p/Zr22qRYxWT/

DongZhaoYu · 2018-09-19T08:30:28Z

Seems no GPUs registered.
on line 1699 of https://paste.ubuntu.com/p/Zzg8Yjd5Hw

18/09/19 07:02:25 INFO nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as 192.168.11.201:8041 with total resource of <memory:51200, vCores:12, GPUs:0, GPUAttribute:0, ports: [100-630],[632-1079],[1081-4193],[4195-5352],[5354-6009],[6011-8117],[8119-10247],[10251-10254],[10257-33721],[33723-33733],[33735-42251],[42253-45687],45689,45691,45693,[45695-45697],45699,[45701-45715],45717,[45719-46541],[46543-50009],[50011-50019],[50021-50074],[50076-58971],[58973-59328],[59330-65535]>

leonfg · 2018-09-19T08:44:32Z

Seems no GPUs registered.
on line 1699 of https://paste.ubuntu.com/p/Zzg8Yjd5Hw

18/09/19 07:02:25 INFO nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as 192.168.11.201:8041 with total resource of <memory:51200, vCores:12, GPUs:0, GPUAttribute:0, ports: [100-630],[632-1079],[1081-4193],[4195-5352],[5354-6009],[6011-8117],[8119-10247],[10251-10254],[10257-33721],[33723-33733],[33735-42251],[42253-45687],45689,45691,45693,[45695-45697],45699,[45701-45715],45717,[45719-46541],[46543-50009],[50011-50019],[50021-50074],[50076-58971],[58973-59328],[59330-65535]>

Thanks! Then how to register GPU? Is it related to the cluster-configuration.yaml? I have modified the yaml file as follows before deploy PAI.

machine-sku:
  GENERIC:
    mem: 1
    gpu:
      type: generic
      count: 1
    cpu:
      vcore: 1
    os: ubuntu16.04
  DELL:
    mem: 64
    gpu:
      type: generic
      count: 1
    cpu:
      vcore: 6
    os: ubuntu16.04
  LENOVO:
    mem: 256
    gpu:
      type: generic
      count: 1
    cpu:
      vcore: 12
    os: ubuntu16.04

machine-list:
  - hostname: openpai-node2
    hostip: 192.168.11.202
    machine-type: LENOVO
    k8s-role: master
    etcdid: etcdid1
    zkid: "1"
    dashboard: "true"
    pai-master: "true"
  - hostname: openpai-node1
    hostip: 192.168.11.201
    machine-type: DELL
    k8s-role: worker
    pai-worker: "true"
  - hostname: openpai-node3
    hostip: 192.168.11.203
    machine-type: LENOVO
    k8s-role: worker
    pai-worker: "true"

qinchen123 · 2018-09-19T08:45:29Z

Thanks, The log is enough for us to detect the issue . will let you updated soon

qinchen123 · 2018-09-19T09:41:20Z

I think I find the root cause. It is caused by the nvidia-smi tool output has different format with all we tested.
The cluster has turn off the gpu ECC, and this provide a "off" value in the output. Our GPU detect currently can detect 0/1/NA for the ECC code :(.

@leonfg, you can Enable your GPU's ECC to workaround this issue.

and mean time, We will enhance the GPU detection code in Node Manager to cover this scenario.

leonfg · 2018-09-19T13:17:42Z

I think I find the root cause. It is caused by the nvidia-smi tool output has different format with all we tested.
The cluster has turn off the gpu ECC, and this provide a "off" value in the output. Our GPU detect currently can detect 0/1/NA for the ECC code :(.

@leonfg, you can Enable your GPU's ECC to workaround this issue.

and mean time, We will enhance the GPU detection code in Node Manager to cover this scenario.

Thanks very much, I will test it tomorrow.
If ECC is enabled, the ECC value in nvidia-smi output maybe "On", due to current GPU detection can only recognize 0/1/NA, you sure it will bypass this issue?

qinchen123 · 2018-09-20T01:31:01Z

If the ECC enabled, this value will be 0 for good, 1 for error, N/A for no ECC feature.

leonfg · 2018-09-20T02:49:28Z

If the ECC enabled, this value will be 0 for good, 1 for error, N/A for no ECC feature.

I enabled ECC and now everything is ok! Thank you!

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 23534 C cntk 489MiB |
+-----------------------------------------------------------------------------+

leonobrien · 2018-11-19T10:08:28Z

Hi has an alternative to ECC been identified for this problem. I have two GPU's (GTX) which do not support ECC configuration.

DongZhaoYu assigned qinchen123 Sep 19, 2018

LMQ12345 closed this as completed Sep 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cntk-test-19844-17854 job failed #1396

cntk-test-19844-17854 job failed #1396

LMQ12345 commented Sep 18, 2018

DongZhaoYu commented Sep 18, 2018

leonfg commented Sep 18, 2018 •

edited

Loading

LMQ12345 commented Sep 18, 2018

leonfg commented Sep 19, 2018

DongZhaoYu commented Sep 19, 2018

leonfg commented Sep 19, 2018 •

edited

Loading

qinchen123 commented Sep 19, 2018

leonfg commented Sep 19, 2018

DongZhaoYu commented Sep 19, 2018

DongZhaoYu commented Sep 19, 2018

leonfg commented Sep 19, 2018 •

edited

Loading

qinchen123 commented Sep 19, 2018

qinchen123 commented Sep 19, 2018

leonfg commented Sep 19, 2018 •

edited

Loading

qinchen123 commented Sep 20, 2018

leonfg commented Sep 20, 2018

leonobrien commented Nov 19, 2018

cntk-test-19844-17854 job failed #1396

cntk-test-19844-17854 job failed #1396

Comments

LMQ12345 commented Sep 18, 2018

DongZhaoYu commented Sep 18, 2018

leonfg commented Sep 18, 2018 • edited Loading

LMQ12345 commented Sep 18, 2018

leonfg commented Sep 19, 2018

DongZhaoYu commented Sep 19, 2018

leonfg commented Sep 19, 2018 • edited Loading

qinchen123 commented Sep 19, 2018

leonfg commented Sep 19, 2018

DongZhaoYu commented Sep 19, 2018

DongZhaoYu commented Sep 19, 2018

leonfg commented Sep 19, 2018 • edited Loading

qinchen123 commented Sep 19, 2018

qinchen123 commented Sep 19, 2018

leonfg commented Sep 19, 2018 • edited Loading

qinchen123 commented Sep 20, 2018

leonfg commented Sep 20, 2018

leonobrien commented Nov 19, 2018

leonfg commented Sep 18, 2018 •

edited

Loading

leonfg commented Sep 19, 2018 •

edited

Loading

leonfg commented Sep 19, 2018 •

edited

Loading

leonfg commented Sep 19, 2018 •

edited

Loading