Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

cntk-test-19844-17854 job failed #1396

Closed
LMQ12345 opened this issue Sep 18, 2018 · 17 comments
Closed

cntk-test-19844-17854 job failed #1396

LMQ12345 opened this issue Sep 18, 2018 · 17 comments
Assignees

Comments

@LMQ12345
Copy link

job:
{
"jobName": "cntk-test-19844-17854",
"image": "aiplatform/pai.run.cntk",
"dataDir": "hdfs://192.168.11.202:9000/Test/cntk/Data",
"outputDir": "hdfs://192.168.11.202:9000/Test/cntk/cntk-test-19844-17854",
"codeDir": "hdfs://192.168.11.202:9000/Test/cntk/BrainScript",
"taskRoles": [
{
"name": "g2p_train",
"taskNumber": 1,
"cpuNumber": 4,
"memoryMB": 8196,
"gpuNumber": 1,
"portList": [
{
"label": "web",
"beginAt": 0,
"portNumber": 1
},
{
"label": "grpc",
"beginAt": 0,
"portNumber": 1
}
],
"command": "cd BrainScript && /bin/bash cntk.sh"
}
]
}
log:

User: admin
[cntk-test-19844-17854][10][UNKNOWN][172.17.0.2][UNKNOWN]
LAUNCHER
 
0 (Higher Integer value indicates higher priority)
FINISHED
default
FAILED
Tue Sep 18 02:05:41 +0000 2018
3sec
History
SUCCEEDED
Unlimited
[ExitStatus]: AM_INTERNAL_NON_TRANSIENT_ERROR [ExitCode]: 184 [ExitDiagnostics]: AM internal non-transient error [ExitType]: NON_TRANSIENT ________________________________________________________________________________________________________________________________________________________________________________________________________ [ExitCustomizedDiagnostics]: onError called into AM from RM due to non-transient error, maybe application is non-compliant. Exception: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=1, maxGPUs=0 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:284) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:231) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:247) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:246) at org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:214) at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:388) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2603) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy8.allocate(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:312) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:270) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException): Invalid resource request, requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=1, maxGPUs=0 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:284) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:231) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:247) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:246) at org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:214) at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:388) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2603) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1493) at org.apache.hadoop.ipc.Client.call(Client.java:1439) at org.apache.hadoop.ipc.Client.call(Client.java:1349) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy7.allocate(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) ... 12 more
false
@DongZhaoYu
Copy link
Member

requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=1, maxGPUs=0

The maximum GPU is 0 in your cluster. Please make sure this configure entry yarn.scheduler.maximum-allocation-gpus set correctly.

@leonfg
Copy link

leonfg commented Sep 18, 2018

requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=1, maxGPUs=0

The maximum GPU is 0 in your cluster. Please make sure this configure entry yarn.scheduler.maximum-allocation-gpus set correctly.

I have same problem!
In my environment, all PAI services are running OK, nvidia-smi tool can get correct GPU information in each driver container. I can see GPU number and GPU memory usage history in PAI_ClusterView page. So I think all these mean my GPUs are available.
Where can I configure the yarn.scheduler.maximum-allocation-gpus?

@LMQ12345
Copy link
Author

requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=1, maxGPUs=0

The maximum GPU is 0 in your cluster. Please make sure this configure entry yarn.scheduler.maximum-allocation-gpus set correctly.
I am not familiar with yarn. Where can I configure the yarn.scheduler.maximum-allocation-gpus?

@leonfg
Copy link

leonfg commented Sep 19, 2018

requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=1, maxGPUs=0

The maximum GPU is 0 in your cluster. Please make sure this configure entry yarn.scheduler.maximum-allocation-gpus set correctly.

I have found two yarn-site.xml files have "yarn.scheduler.maximum-allocation-gpus" configure entry in dev-box/pai/pai-management/bootstrap, did you mean one of these? The value in two yarn-site.xml are both "8".

@DongZhaoYu
Copy link
Member

you can open http://yarn_resource_manager_address:8088/conf to get all the live configurations.
And search yarn.scheduler.maximum-allocation-gpus to see its current value

@leonfg
Copy link

leonfg commented Sep 19, 2018

you can open http://yarn_resource_manager_address:8088/conf to get all the live configurations.
And search yarn.scheduler.maximum-allocation-gpus to see its current value

<property>
  <name>yarn.scheduler.maximum-allocation-gpus</name>
  <value>8</value>
  <final>false</final>
  <source>yarn-site.xml</source>
</property>

The value is 8. Infect I have 3 nodes and each node has 1 GPU.
In grafana dashborad all GPUs are visible, but in http://yarn_resource_manager_address:8088/cluster/nodes, GPUs Total is 0.

@qinchen123
Copy link
Contributor

If GPUs total is 0, it means the Hadoop-node-manager has some issue to report the GPUs.
Can you grab some Hadoop-node-mananger log ?

@leonfg
Copy link

leonfg commented Sep 19, 2018

If GPUs total is 0, it means the Hadoop-node-manager has some issue to report the GPUs.
Can you grab some Hadoop-node-mananger log ?

I downloaded some Hadoop-node-manange log from k8s web portal: https://github.com/leonfg/leonfg.github.io/blob/master/pai/logs-from-hadoop-node-manager.7z
If these are not what you want, please give me some instructions on how to get the log.

@DongZhaoYu
Copy link
Member

You can put the log to pastebin for easy access.
I put the above logs here:
https://paste.ubuntu.com/p/Zzg8Yjd5Hw/
https://paste.ubuntu.com/p/Zr22qRYxWT/

@DongZhaoYu
Copy link
Member

Seems no GPUs registered.
on line 1699 of https://paste.ubuntu.com/p/Zzg8Yjd5Hw

18/09/19 07:02:25 INFO nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as 192.168.11.201:8041 with total resource of <memory:51200, vCores:12, GPUs:0, GPUAttribute:0, ports: [100-630],[632-1079],[1081-4193],[4195-5352],[5354-6009],[6011-8117],[8119-10247],[10251-10254],[10257-33721],[33723-33733],[33735-42251],[42253-45687],45689,45691,45693,[45695-45697],45699,[45701-45715],45717,[45719-46541],[46543-50009],[50011-50019],[50021-50074],[50076-58971],[58973-59328],[59330-65535]>

@leonfg
Copy link

leonfg commented Sep 19, 2018

Seems no GPUs registered.
on line 1699 of https://paste.ubuntu.com/p/Zzg8Yjd5Hw

18/09/19 07:02:25 INFO nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as 192.168.11.201:8041 with total resource of <memory:51200, vCores:12, GPUs:0, GPUAttribute:0, ports: [100-630],[632-1079],[1081-4193],[4195-5352],[5354-6009],[6011-8117],[8119-10247],[10251-10254],[10257-33721],[33723-33733],[33735-42251],[42253-45687],45689,45691,45693,[45695-45697],45699,[45701-45715],45717,[45719-46541],[46543-50009],[50011-50019],[50021-50074],[50076-58971],[58973-59328],[59330-65535]>

Thanks! Then how to register GPU? Is it related to the cluster-configuration.yaml? I have modified the yaml file as follows before deploy PAI.

machine-sku:
  GENERIC:
    mem: 1
    gpu:
      type: generic
      count: 1
    cpu:
      vcore: 1
    os: ubuntu16.04
  DELL:
    mem: 64
    gpu:
      type: generic
      count: 1
    cpu:
      vcore: 6
    os: ubuntu16.04
  LENOVO:
    mem: 256
    gpu:
      type: generic
      count: 1
    cpu:
      vcore: 12
    os: ubuntu16.04

machine-list:
  - hostname: openpai-node2
    hostip: 192.168.11.202
    machine-type: LENOVO
    k8s-role: master
    etcdid: etcdid1
    zkid: "1"
    dashboard: "true"
    pai-master: "true"
  - hostname: openpai-node1
    hostip: 192.168.11.201
    machine-type: DELL
    k8s-role: worker
    pai-worker: "true"
  - hostname: openpai-node3
    hostip: 192.168.11.203
    machine-type: LENOVO
    k8s-role: worker
    pai-worker: "true"

@qinchen123
Copy link
Contributor

Thanks, The log is enough for us to detect the issue . will let you updated soon

@qinchen123
Copy link
Contributor

I think I find the root cause. It is caused by the nvidia-smi tool output has different format with all we tested.
The cluster has turn off the gpu ECC, and this provide a "off" value in the output. Our GPU detect currently can detect 0/1/NA for the ECC code :(.

@leonfg, you can Enable your GPU's ECC to workaround this issue.

and mean time, We will enhance the GPU detection code in Node Manager to cover this scenario.

@leonfg
Copy link

leonfg commented Sep 19, 2018

I think I find the root cause. It is caused by the nvidia-smi tool output has different format with all we tested.
The cluster has turn off the gpu ECC, and this provide a "off" value in the output. Our GPU detect currently can detect 0/1/NA for the ECC code :(.

@leonfg, you can Enable your GPU's ECC to workaround this issue.

and mean time, We will enhance the GPU detection code in Node Manager to cover this scenario.

Thanks very much, I will test it tomorrow.
If ECC is enabled, the ECC value in nvidia-smi output maybe "On", due to current GPU detection can only recognize 0/1/NA, you sure it will bypass this issue?

@qinchen123
Copy link
Contributor

If the ECC enabled, this value will be 0 for good, 1 for error, N/A for no ECC feature.

@leonfg
Copy link

leonfg commented Sep 20, 2018

If the ECC enabled, this value will be 0 for good, 1 for error, N/A for no ECC feature.

I enabled ECC and now everything is ok! Thank you!

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P5000 On | 00000000:65:00.0 Off | 0 |
| 26% 39C P0 42W / 180W | 500MiB / 15245MiB | 12% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 23534 C cntk 489MiB |
+-----------------------------------------------------------------------------+

@leonobrien
Copy link

Hi has an alternative to ECC been identified for this problem. I have two GPU's (GTX) which do not support ECC configuration.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants