-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Conformance Program Doc for AutoML and Training WG #2048
Add Conformance Program Doc for AutoML and Training WG #2048
Conversation
7b3c450
to
22e4467
Compare
/hold for the review |
docs/proposals/conformance-test.md
Outdated
That will help to avoid environment variance and improve fault tolerance. Driver is required to trigger the deployment and download the results. | ||
|
||
- We are going to use | ||
[the unify Makefile](https://github.com/kubeflow/kubeflow/blob/2fa0d3665234125aeb8cebe8fe44f0a5a50791c5/conformance/1.5/Makefile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'unify' -> 'unified'
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for leading this effort!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
docs/proposals/conformance-test.md
Outdated
|
||
## Kubeflow Conformance | ||
|
||
Kubeflow conformance consists the 3 category of tests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially the Kubeflow conformance will include CRD based tests. In the future, API and UI based tests may be added.
docs/proposals/conformance-test.md
Outdated
In the following versions, we should design conformance program for the | ||
Katib API-based tests. | ||
|
||
- CRD-based tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be the 1st bullet in this list before API and UI tests.
- The tests should cover basic functionality of Katib and the Training Operator. | ||
It will not cover all features. | ||
- The tests are expected to evolve in the future versions. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The tests should have a well documented and short list of set-up requirements.
- The tests should install and complete in a relatively short period of time (< 30 minutes) with suggested minimum infrastructure requirements i.e. 3 nodes, 24 vcpu, 64 GB RAM, 500 GB Disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that we can achieve < 30 minutes requirement.
If we are going to run more than 1 Katib Experiment in the future, we might need more time. WDYT @johnugeorge ?
What about Pipelines team @james-jwu ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more idea - The Katib and Training Operator configuration and tests should make attempts to be integrated with the Pipeline configuration and test configuration. (My point is that we should try to minimize the conformance testing configuration and resource requirements if/when possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipeline requirement is relatively light. See the below in setup.yaml:
cpu: "2"
memory: 2Gi
requests.storage: "5Gi"
It's been a while since I last ran the Pipeline tests, but they are quite fast (<15 min for sure).
How long does the current Katib and Training tests run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@james-jwu Is resources a mandatory requirement ? We have been running Katib deployment + tests on Github CI which has 2-core CPU and 7G memory. Since allocated resources are bit tight, we have seen that certain runs have exceeded 30 min limit. However, if we have slightly more CPU resources, we can get it in 30 min easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the katib's hyperparameter searching doesn't care much how the each training step goes on actually, we could set very-small epochs or very-small nueral network for conformance test's experiments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the 1st version I think it is okay to require more resources. Jaeyeon's suggestion also sounds great.
docs/proposals/conformance-test.md
Outdated
- The above report can be downloaded from the test deployment by running `make report`. | ||
|
||
- When all reports have been collected, the distributions are going to create PR | ||
to publish the reports. The Kubeflow Conformance Committee will verify it and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to publish the reports and to update the appropriate Kubeflow.org web pages on conformant Kubeflow distributions.
Thank you for the review @jbottum @james-jwu. I addressed your points. |
46a5be1
to
604cf2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Thanks for this proposal!
LGTM
LGTM, thanks! |
I guess, all the comments have been addressed. Thanks for the review! |
@andreyvelich Thanks! |
/lgtm |
Thanks everyone for the review, looking forward for our next steps. |
Related: kubeflow/training-operator#1695, #2044.
Original doc: https://docs.google.com/document/d/1TRUKUY1zCCMdgF-nJ7QtzRwifsoQop0V8UnRo-GWlpI/edit#.
I've added conformance doc for CRD-based test for AutoML and Training WG.
Please take a look.
/assign @james-jwu @johnugeorge @tenzen-y @jbottum @anencore94
cc other WGs
@kubeflow/wg-training-leads
@kubeflow/wg-pipeline-leads
@kubeflow/wg-notebooks-leads
@kubeflow/wg-manifests-leads