-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[K8S] Encounter executors restart issue, when run Hibench Kmeans workload on AWS EKS environment, with Hibench small scale dataset #109
Comments
@xwu99 @carsonwang , please have a check when available. Thanks |
What version are you testing? The master branch has some dependencies of GPU libraries that I am working to fix in #111 |
@zhixingheyi-tian Pls try the latest master branch to see if everything ok. Also @haojinIntel |
Now our team Cloud integration testings (include on EMR, Dataproc, EKS, etc.) are for branch-1.2. Cloud integration is also the target of 1.2 release.
"EKS" just is Amazon Elastic Kubernetes Service. We used K8S scheduler to launch Spark Executor. From the "KMeansDAL" log, the oap-mllib jars have enabled in classpath. The driver log is shown above. The executor log before killed is as below:
I doubt if there are some limitations for Spark configurations. For example: "executor.memory". |
@zhixingheyi-tian From the log, it's not related to memory. oneCCL ranks are not about to connect to the rank 0 which is the executor listening in 192.168.82.1 port 3000. There is some network issue. Maybe the port is blocked or the IP is not reachable. |
btw. It's encouraged to use build.sh instead of just using mvn to build. Some additional environments need to be set before calling mvn. |
@xwu99 |
Add "spark.shuffle.reduceLocality.enabled=false" conf item, Two executors case, encounter hang issue, on both driver and executor sides.
driver hang log
executor hang log
One executor case, encounter executor restart issue without error log. And only restarted once. And the job completed with the second exector
Driver log
Executor exited without error.
|
Add another test to make the domain name shorter with "spark.kubernetes.executor.podNamePrefix=oapmllib"
And the executor also restarted.
|
We are integrating OAP MLlib into Cloud. Encountered the executors restart issue, although the result still outputted , when run Hibench Kmeans workload on AWS EKS environment, with Hibench small scale dataset.
Spark confs and command:
data scale conf
logs
But when vanilla spark ran this workload with the same confs, everything is ok.
Vanilla spark command:
Vanilla spark running log
The text was updated successfully, but these errors were encountered: