Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dr-Elephant not fetching RUNNING spark application (only succeeded and failed applications are fetched) #696

Open
nelhaj opened this issue Aug 28, 2020 · 4 comments
Assignees
Labels

Comments

@nelhaj
Copy link

nelhaj commented Aug 28, 2020

Hi,

Dr-Elephant is only fetching completed applications (filtered by SUCCEEDED or FAILED status).
Our spark streaming applications are always RUNNING non-stop (except for weekly restarts).
We want to be able to analyze them and generate real time heuristics.

Why does dr-elephant exclude running application ?
Is there a way to include them when fetching jobs list?

More details:

  • We are using SparkFetcher.

  • Dr. Elephant gets list of only succeeded and failed applications from Yarn History Server API:

applicationscom.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is https://{YARN_RM_HOST}/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1598632623112&finishedTimeEnd=1598632683376
com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The failed apps URL is https://{ YARN_RM_HOST}/ws/v1/cluster/apps?finalStatus=FAILED&state=FINISHED&finishedTimeBegin=1598632623112&finishedTimeEnd=1598632683376
  • And then data are fetched from the Spark History server API:
com.linkedin.drelephant.spark.fetchers.SparkRestClient : calling REST API at http://{SHS_HOST}/api/v1/applications/application_xxxxxxxxxx_xxxxxx
08-28-2020 18:56:09 INFO  [ForkJoinPool-1-worker-3] com.linkedin.drelephant.spark.fetchers.SparkRestClient : creating SparkApplication by calling REST API at http://{SHS_HOST}/api/v1/applications/application_xxxxxxxxxx_xxxxxx/1/logs to get eventlogs

Running spark application are available in both YARN HS and Spark HS. I can retrieve log events by accessing http://{SHS_HOST}/api/v1/applications/application_xxxxxxxxxx_xxxxxx/1/logs

Thank you

@nelhaj
Copy link
Author

nelhaj commented Sep 25, 2020

Hi
@ShubhamGupta29 : Could you help us on this subject please.
PS: We have made good progress in implementing this feature. it seems to work fine. We can see Spark Streaming Heuristics.
We are using the Spark FsFetcher.
We will keep you posted on our progress.

I would like to know why the dr-elephant does not support fetching RUNNING applications natively. Is there a reason for this choice (performance, technical constraints, ...).

Thx

@ShubhamGupta29
Copy link
Contributor

Initially, Dr.Elephant was designed to profile a Hadoop job after finishes. This idea stayed with the Spark Heuristics too. But with the increased demand Spark streaming we do know the importance of a tool to track your jobs' performance.

The reason for not supporting the Spark Streaming applications is the large logs. Currently, SHS doesn't provide any incremental parsing of logs, so if Dr.Elephant analyzes a RUNNING application at some short interval then it has to parse the whole logs every time and with Streaming jobs, this issue becomes critical as their log size keeps on increasing. This will hog the Dr.Elephant's resources and lead to delays in report generation etc. With the batch jobs, the need for real-time profiling is not that missed, so there are challenges to support RUNNING apps in Dr.Elephant.

I would be glad to know how you are approaching these challenges and would try to provide any needed assistance from my end.

@nelhaj
Copy link
Author

nelhaj commented Nov 10, 2020

Hi,
@ShubhamGupta29
Thank you for your clear clarification and sorry for the late reply
In fact, we are also facing the same performance issues for spark streaming apps analysis.

We try to deal with these problems in the following way :

  • Increase the analysis fetch interval for streaming applications (example: spark.streaming.analysis.fetch.interval = 10 * analysis.fetch.interval, requires a custom development)
  • Use of FSFetcher instead of SparkFetcher. FSFetcher is much more stable. This solves timeout and memory overhead on SHS issues
  • Limit event log file size (using event_log_size_limit_in_mb param). Indeed, a representative dataset of a few hours / days should be sufficient to have relevant heuristics.
  • Read and parse event log files to disk instead of memory (using leveldb for example, requires a custom development). This should reduce RAM usage but will increase the analysis time
  • Depending on the complexity, use different queues for batch and streaming applications, in order not to delay the analysis of batch applications (requires a custom development)

@Javid-Shaik
Copy link

Javid-Shaik commented Jul 10, 2024

Hi @nelhaj
We're also want to implement the spark streaming jobs analysis so can you please share how you achieved this.

Could you share how you modified Dr. Elephant to fetch and analyze running applications?

Any additional tips or considerations for implementing this feature.

Your insights would be greatly appreciated.

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants