-
Notifications
You must be signed in to change notification settings - Fork 549
User job write too much logs will cause disk pressure #4694
Comments
@Binyang2014, please firstly make sure the OpenPAI service pods are of higher QoS class than job pods. In some case the service pods get evicted. |
We may need to mark these pods as critical to achieve this: https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ |
Checked the QoS class. Currently, job-exporter, node-exporter qos are For For this case, the resource is disk, and all pod don't claim the requests for the disk. So the eviction order will rank by pod priority, then resource usage.
Since user job usually consume more disk, it will be evicted first. But we use hostPath for the log folder, evict user job will not solve the problem. Then, kubelet continue to evict pai service pod. To leverage k8s eviction policy to avoid disk pressure, we'd better not store job logs in each host. |
And our log-manger is mis-configured. It not rotate the log according to size, but according to time. After reconfigure the log-manager and fix some bugs, this issue can be mitigated. |
@Binyang2014 Moreover, it seems we need the following:
Anything more? |
Since the QoS class is assigned by k8s according to pod resource We can do following:
|
Closed, this issue already fixed |
PAI will keep user job log under /var/log/pai.
If user job write too much logs, it will cause machine disk pressure.
We need to:
The text was updated successfully, but these errors were encountered: