Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Pure K8S Alpha Release Plan #3382

Closed
36 of 42 tasks
abuccts opened this issue Aug 15, 2019 · 10 comments
Closed
36 of 42 tasks

Pure K8S Alpha Release Plan #3382

abuccts opened this issue Aug 15, 2019 · 10 comments

Comments

@abuccts
Copy link
Member

abuccts commented Aug 15, 2019

Alpha release for k8s based PAI

Code complete date: Nov. 12

Plan Items

Deferred

Finished

Backlogs

  • k8s version Jenkins
  • choose build images
  • drivers installation
  • 🕒REST API for VC (HiveD @abuccts and default VC @Binyang2014 )
  • 🕒Implement v1 api for jobs
@mzmssg
Copy link
Member

mzmssg commented Aug 15, 2019

Details about AAD in #3378.
Besides, I think we need to add deployment about controller and scheduler

@abuccts abuccts pinned this issue Aug 15, 2019
@abuccts abuccts self-assigned this Aug 19, 2019
@mzmssg
Copy link
Member

mzmssg commented Aug 22, 2019

Need replace metrics from hadoop, like:

  1. dashboard
    image
  2. hardware
    image

@mzmssg
Copy link
Member

mzmssg commented Aug 22, 2019

Currently, default taskRole name contains capital letters and underline, which are illegel for k8s.
image

@wangdian
Copy link
Member

Job keep retry and cannot be stopped.

@fanyangCS
Copy link
Contributor

@yqwang-ms can you take a look at the "job keeps retrying" issue?

@abuccts
Copy link
Member Author

abuccts commented Aug 28, 2019

Job keep retry and cannot be stopped.

@yqwang-ms can you take a look at the "job keeps retrying" issue?

This is caused by someone else's test, not related to the code, the whole bed is down.

@yqwang-ms
Copy link
Member

All nodes becomes unknown, so controller and scheduler in statefulset is down.
I have manually restarted them, but all worker nodes are still down.
Need to investigate why all worker nodes in 15 and 16 beds are down.

@yqwang-ms
Copy link
Member

Seems kubelet are removed in all worker nodes. (probably caused by someone cleaned the k8s cluster)

Even no kubelet history in 10.151.41.21 (15 bed worker)

core@paigcr-a-gpu-1115:~$ sudo docker ps -a
CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS                      PORTS               NAMES
71a06c62c9d5        hello-world                          "/hello"                 3 months ago        Exited (0) 3 months ago                         quizzical_kare
9882be1aa427        hello-world                          "/hello"                 3 months ago        Exited (0) 3 months ago                         nervous_einstein
b14e8b513c45        openpai.azurecr.io/dls/load_client   "python /load_client…"   6 months ago        Exited (137) 3 months ago                       ecstatic_leavitt
6fd11bd7983e        openpai.azurecr.io/dls/load_client   "python /load_client…"   6 months ago        Exited (137) 3 months ago                       zen_kapitsa
a54f9aa4fa7f        openpai.azurecr.io/dls/load_client   "python /load_client…"   6 months ago        Exited (137) 3 months ago                       agitated_khorana
c2ce59b04a8f        openpai.azurecr.io/dls/load_client   "python /load_client…"   6 months ago        Exited (137) 3 months ago                       jovial_boyd
adfca1e60e7e        openpai.azurecr.io/dls/load_client   "python /load_client…"   6 months ago        Exited (137) 3 months ago                       nostalgic_lovelace
d94900c2a098        openpai.azurecr.io/dls/load_client   "python /load_client…"   11 months ago       Exited (137) 3 months ago                       condescending_swirles
d38a16902aef        openpai.azurecr.io/dls/load_client   "python /load_client…"   11 months ago       Exited (137) 3 months ago                       fervent_dijkstra
0e1991722258        827fd3415f7f                         "/bin/bash"              11 months ago       Exited (0) 11 months ago                        wonderful_heisenberg
0bd2ec5f51dc        827fd3415f7f                         "/bin/bash"              11 months ago       Exited (0) 11 months ago                        dreamy_curran
cc2496d1fdad        827fd3415f7f                         "/bin/bash"              11 months ago       Exited (0) 11 months ago                        optimistic_euler
c5a9ebfc72a5        openpai.azurecr.io/dls/load_client   "python /load_client…"   11 months ago       Exited (137) 3 months ago                       admiring_davinci

@yqwang-ms
Copy link
Member

Synced with Hanyu, seems it is caused by his paictl operations.
Let's redeploy 15 and 16.

@yqwang-ms
Copy link
Member

15 is recovered.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants