You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The first problem I got is that HPA cannot find PyTorchJob Kind.
yzhang2343@C02WC268HTDD ds-repos % kubectl get horizontalpodautoscaler/elastic-example-imagenet-c10d-tcp -o yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
annotations:
autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"False","lastTransitionTime":"2022-08-01T16:16:44Z","reason":"FailedGetScale","message":"the HPA controller was unable to get the target''s current scale: no matches for kind \"PyTorchJob\" in group \"\""}]'
According to this link, the HPA must specify .spec.scaleTargetRef.apiVersion so that it can successfully find the resource, but training-operator does not set this field while creating the HPA:
After I manually added the apiVersion to HPA, it could successfully find PyTorchJob kind but it still could not perform auto scaling because of this new error:
yzhang2343@C02WC268HTDD test % kubectl get horizontalpodautoscaler/elastic-example-imagenet-c10d-tcp -o yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
annotations:
autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"True","lastTransitionTime":"2022-08-01T16:51:48Z","reason":"SucceededGetScale","message":"the HPA controller was able to get the target''s current scale"},{"type":"ScalingActive","status":"False","lastTransitionTime":"2022-08-01T16:51:48Z","reason":"InvalidSelector","message":"the HPA target''s scale is missing a selector"}]'
According to the Kubernetes HPA doc, the target resource should have a .spec.selector field so that HPA can find pods using the selectors. However, PyTorchJob struct does not have .spec.selector field at all:
typePyTorchJobSpecstruct {
// RunPolicy encapsulates various runtime policies of the distributed training// job, for example how to clean up resources and how long the job can stay// active.//+kubebuilder:validation:OptionalRunPolicy commonv1.RunPolicy`json:"runPolicy"`ElasticPolicy*ElasticPolicy`json:"elasticPolicy,omitempty"`// A map of PyTorchReplicaType (type) to ReplicaSpec (value). Specifies the PyTorch cluster configuration.// For example,// {// "Master": PyTorchReplicaSpec,// "Worker": PyTorchReplicaSpec,// }PyTorchReplicaSpecsmap[commonv1.ReplicaType]*commonv1.ReplicaSpec`json:"pytorchReplicaSpecs"`
}
The expected behavior is that a HorizontalPodAutoscaler should be launched successfully and the number of pods of the training job should be dynamically and automatically updated by HPA according to the resource utilization.
It seems to me that the training-operator's support for HPA is incomplete. Does anybody know how to launch elastic pytorch jobs with HPA? Thanks!
The text was updated successfully, but these errors were encountered:
Background
Hi! I'm trying to launch elastic PytorchJob on training-operator with horizontal pod autoscaling (HPA), but I cannot get it work.
Here is the manifest of the job that I try to launch:
The image contains the imagenet example training code from the examples given under this repo.
Cannot Find PyTorchJob Kind
The first problem I got is that HPA cannot find PyTorchJob Kind.
According to this link, the HPA must specify
.spec.scaleTargetRef.apiVersion
so that it can successfully find the resource, but training-operator does not set this field while creating the HPA:Missing a Selector
After I manually added the apiVersion to HPA, it could successfully find PyTorchJob kind but it still could not perform auto scaling because of this new error:
According to the Kubernetes HPA doc, the target resource should have a
.spec.selector
field so that HPA can find pods using the selectors. However, PyTorchJob struct does not have.spec.selector
field at all:The expected behavior is that a
HorizontalPodAutoscaler
should be launched successfully and the number of pods of the training job should be dynamically and automatically updated by HPA according to the resource utilization.It seems to me that the training-operator's support for HPA is incomplete. Does anybody know how to launch elastic pytorch jobs with HPA? Thanks!
The text was updated successfully, but these errors were encountered: