Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Job keep waiting, can't submit to YARN #1274

Closed
hao1939 opened this issue Sep 5, 2018 · 6 comments
Closed

Job keep waiting, can't submit to YARN #1274

hao1939 opened this issue Sep 5, 2018 · 6 comments
Assignees
Labels
deployment PAI deployment related known issue

Comments

@hao1939
Copy link
Contributor

hao1939 commented Sep 5, 2018

A lot of jobs in 'waiting':
image

Only the first one submitted to yarn:
image

@mzmssg mzmssg self-assigned this Sep 5, 2018
@yqwang-ms
Copy link
Member

yqwang-ms commented Sep 5, 2018

The first job submission cannot be finished due to RM issues, so blocked the following submission.
However, even if we try to submit other jobs, they still cannot be submitted in time, so waiting is expected.
It is a platform issue (RM abnormal), DRI should handle this, and we should have watchdog for RM or Launcher for this case.

The YarnClient looping on this:
2018-09-04 13:08:03,770 INFO [pool-1-thread-1] org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Application submission is not finished, submitted application application_1536066266746_0001 is still in NEW_SAVING

@hao1939
Copy link
Contributor Author

hao1939 commented Sep 5, 2018

Hi @yqwang-ms ,

Could we raise a timeout error after a period of time waiting submitting? (I'm not against to figure out
and fix the root cause.)

My concern is that no matter what happens to YRAN or ZK or anything, should we keep user waiting forever? Especially there is no way to succeed.

@yqwang-ms
Copy link
Member

It is YARN keep waiting forever, we should first to let YARN have a timeout instead of first make a workaround on top of it.
It is a livesite issue, launcher may help, but DRI / watchdog also need to notify customer.

@yqwang-ms
Copy link
Member

YARN have a configuration to set the timeout, you can leverage it:
/**

  • The duration that the yarn client library waits, cumulatively across polls,
  • for an expected state change to occur. Defaults to -1, which indicates no
  • limit.
    */
    public static final String YARN_CLIENT_APPLICATION_CLIENT_PROTOCOL_POLL_TIMEOUT_MS =
    YARN_PREFIX + "client.application-client-protocol.poll-timeout-ms";
    public static final long DEFAULT_YARN_CLIENT_APPLICATION_CLIENT_PROTOCOL_POLL_TIMEOUT_MS =
    -1;

@hao1939 hao1939 added the deployment PAI deployment related label Sep 5, 2018
mzmssg added a commit that referenced this issue Sep 6, 2018
15mins timeout for waiting yarn, prevent hang in jenkins test #1274
@mzmssg
Copy link
Member

mzmssg commented Sep 7, 2018

Restart RM can fix it, but exception is in zookeeper log, need further investigation

@fanyangCS
Copy link
Contributor

RM issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
deployment PAI deployment related known issue
Projects
None yet
Development

No branches or pull requests

4 participants