Skip to content

Commit

Permalink
[SDK] Support PyTorchJob as a Trial Worker (#2512)
Browse files Browse the repository at this point in the history
* [SDK] Support PyTorchJob as Trial Worker

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix pod spec for Job

Signed-off-by: Andrey Velichkevich <[email protected]>

* Set default restart_policy to Never

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix primary_container_name for PyTorchJob

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add unit tests for PyTorchJob as Trial

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add e2e test for PyTorchJob as Trial

Signed-off-by: Andrey Velichkevich <[email protected]>

* Bump kubeflow-training SDK

Signed-off-by: Andrey Velichkevich <[email protected]>

* Deploy Training Operator with server side apply

Signed-off-by: Andrey Velichkevich <[email protected]>

* Decrease CPUs for E2E

Signed-off-by: Andrey Velichkevich <[email protected]>

* Install Training Operator for tune workflow

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix comments

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
  • Loading branch information
andreyvelich authored Feb 13, 2025
1 parent 6389cba commit 7b46520
Show file tree
Hide file tree
Showing 15 changed files with 395 additions and 194 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/e2e-test-tune-api.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,12 @@ jobs:
uses: ./.github/workflows/template-setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}

- name: Run e2e test with tune API
uses: ./.github/workflows/template-e2e-test
with:
tune-api: true
training-operator: true

strategy:
fail-fast: false
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -78,3 +78,6 @@ $RECYCLE.BIN/

## Vendor dir
vendor

# Jupyter Notebooks.
**/.ipynb_checkpoints
4 changes: 1 addition & 3 deletions hack/gen-python-sdk/post_gen.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,7 @@ def _rewrite_helper(input_file, output_file, rewrite_rules):
lines.append("# Import Katib API client.\n")
lines.append("from kubeflow.katib.api.katib_client import KatibClient\n")
lines.append("# Import Katib TrainerResources class.\n")
lines.append(
"from kubeflow.katib.types.trainer_resources import TrainerResources\n"
)
lines.append("from kubeflow.katib.types.types import TrainerResources\n")
lines.append("# Import Katib report metrics functions\n")
lines.append("from kubeflow.katib.api.report_metrics import report_metrics\n")
lines.append("# Import Katib helper functions.\n")
Expand Down
36 changes: 18 additions & 18 deletions pkg/apis/v1beta1/swagger.json
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@
"$ref": "#/definitions/v1beta1.AlgorithmSetting"
},
"x-kubernetes-list-map-keys": [
"Name"
"name"
],
"x-kubernetes-list-type": "map"
},
Expand All @@ -148,7 +148,7 @@
"$ref": "#/definitions/.v1beta1.SuggestionCondition"
},
"x-kubernetes-list-map-keys": [
"Type"
"type"
],
"x-kubernetes-list-type": "map"
},
Expand All @@ -173,7 +173,7 @@
"$ref": "#/definitions/.v1beta1.TrialAssignment"
},
"x-kubernetes-list-map-keys": [
"Name"
"name"
],
"x-kubernetes-list-type": "map"
}
Expand Down Expand Up @@ -217,7 +217,7 @@
"$ref": "#/definitions/v1beta1.EarlyStoppingRule"
},
"x-kubernetes-list-map-keys": [
"Name"
"name"
],
"x-kubernetes-list-type": "map"
},
Expand All @@ -241,7 +241,7 @@
"$ref": "#/definitions/v1beta1.ParameterAssignment"
},
"x-kubernetes-list-map-keys": [
"Name"
"name"
],
"x-kubernetes-list-type": "map"
}
Expand Down Expand Up @@ -323,7 +323,7 @@
"$ref": "#/definitions/v1beta1.EarlyStoppingRule"
},
"x-kubernetes-list-map-keys": [
"Name"
"name"
],
"x-kubernetes-list-type": "map"
},
Expand Down Expand Up @@ -356,7 +356,7 @@
"$ref": "#/definitions/v1beta1.ParameterAssignment"
},
"x-kubernetes-list-map-keys": [
"Name"
"name"
],
"x-kubernetes-list-type": "map"
},
Expand Down Expand Up @@ -402,7 +402,7 @@
"$ref": "#/definitions/.v1beta1.TrialCondition"
},
"x-kubernetes-list-map-keys": [
"Type"
"type"
],
"x-kubernetes-list-type": "map"
},
Expand Down Expand Up @@ -450,7 +450,7 @@
"$ref": "#/definitions/v1beta1.AlgorithmSetting"
},
"x-kubernetes-list-map-keys": [
"Name"
"name"
],
"x-kubernetes-list-type": "map"
}
Expand Down Expand Up @@ -539,7 +539,7 @@
"$ref": "#/definitions/v1beta1.EarlyStoppingSetting"
},
"x-kubernetes-list-map-keys": [
"Name"
"name"
],
"x-kubernetes-list-type": "map"
}
Expand Down Expand Up @@ -681,7 +681,7 @@
"$ref": "#/definitions/v1beta1.ParameterSpec"
},
"x-kubernetes-list-map-keys": [
"Name"
"name"
],
"x-kubernetes-list-type": "map"
},
Expand Down Expand Up @@ -711,7 +711,7 @@
"$ref": "#/definitions/v1beta1.ExperimentCondition"
},
"x-kubernetes-list-map-keys": [
"Type"
"type"
],
"x-kubernetes-list-type": "map"
},
Expand Down Expand Up @@ -968,7 +968,7 @@
"$ref": "#/definitions/v1beta1.Operation"
},
"x-kubernetes-list-map-keys": [
"OperationType"
"operationType"
],
"x-kubernetes-list-type": "map"
}
Expand Down Expand Up @@ -1000,7 +1000,7 @@
"$ref": "#/definitions/v1beta1.MetricStrategy"
},
"x-kubernetes-list-map-keys": [
"Name"
"name"
],
"x-kubernetes-list-type": "map"
},
Expand All @@ -1025,7 +1025,7 @@
"$ref": "#/definitions/v1beta1.Metric"
},
"x-kubernetes-list-map-keys": [
"Name"
"name"
],
"x-kubernetes-list-type": "map"
}
Expand All @@ -1045,7 +1045,7 @@
"$ref": "#/definitions/v1beta1.ParameterSpec"
},
"x-kubernetes-list-map-keys": [
"Name"
"name"
],
"x-kubernetes-list-type": "map"
}
Expand All @@ -1072,7 +1072,7 @@
"$ref": "#/definitions/v1beta1.ParameterAssignment"
},
"x-kubernetes-list-map-keys": [
"Name"
"name"
],
"x-kubernetes-list-type": "map"
}
Expand Down Expand Up @@ -1193,7 +1193,7 @@
"$ref": "#/definitions/v1beta1.TrialParameterSpec"
},
"x-kubernetes-list-map-keys": [
"Name"
"name"
],
"x-kubernetes-list-type": "map"
},
Expand Down
36 changes: 18 additions & 18 deletions pkg/apis/v1beta1/zz_generated.openapi.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion sdk/python/v1beta1/kubeflow/katib/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@
# Import Katib API client.
from kubeflow.katib.api.katib_client import KatibClient
# Import Katib TrainerResources class.
from kubeflow.katib.types.trainer_resources import TrainerResources
from kubeflow.katib.types.types import TrainerResources
# Import Katib report metrics functions
from kubeflow.katib.api.report_metrics import report_metrics
# Import Katib helper functions.
Expand Down
Loading

0 comments on commit 7b46520

Please sign in to comment.