Can not run ENAS experiment #2494

shubham-ojha-weheal · 2025-01-17T10:28:16Z

What happened?

I wanted to run an ENAS experiment on my own dataset but wanted to confirm that ENAS experiment was properly running in my kubeflow deployment.
So I copied the yaml file at https://github.com/kubeflow/katib/blob/master/examples/v1beta1/nas/enas-cpu.yaml to run in the Katib UI.
I directly create the experiment by pasting in the YAML file as it is.
The experiment gets stuck after it creates and run the trials. This is because the trials after running go into a NotReady state after I run kubectl get pods -n moderation (moderation is the namespace I am using)

One weird thing I noticed was that the instead of displaying numbers in the Validation Accuracy metric, it displays the input to the docker image. See Trial Details on Experiment Page below.

I installed the Kubeflow v1.9.1 which is running on AWS by following the command while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done

As the issue might be because I am using kubernetes version 1.31, I even installed latest katib by using the command kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-with-kubeflow?ref=master" . This caused the buttons on the UI to have duller blue color. But the same results.

Here are the screenshots of the relevant pages:
Katib UI

Katib UI Experiment Details

Katib UI Trial Details

Pod Statuses (running kubectl get pods -n moderation)

NAME                                              READY   STATUS     RESTARTS   AGE
dataset-handler-0                                 2/2     Running    0          3d22h
enas-cpu-66jqk54g-6p6kn                           2/3     NotReady   0          7m11s
enas-cpu-cbndflxf-7c56m                           2/3     NotReady   0          7m11s
enas-cpu-enas-6595f7f74b-smwcl                    1/1     Running    0          7m22s
ml-pipeline-ui-artifact-6b44b849d7-9fthm          2/2     Running    0          26h
ml-pipeline-visualizationserver-5fcb5568f-fzm6z   2/2     Running    0          6d19h
pipelines-0                                       2/2     Running    0          2d3h
spanner-test-0                                    2/2     Running    0          4d3h
zas-74b68bb967-vnsq7                              2/2     Running    0          6d4h

Trial Details on Experiment Page

YAML for Experiment from Katib UI

metadata:
  name: enas-cpu
  namespace: moderation
  uid: 39920b93-2bc8-40bb-9565-ec8c24c361d2
  resourceVersion: '5141549'
  generation: 1
  creationTimestamp: '2025-01-17T10:17:13Z'
  finalizers:
    - update-prometheus-metrics
  managedFields:
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2025-01-17T10:17:13Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            .: {}
            v:"update-prometheus-metrics": {}
    - manager: katib-ui
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2025-01-17T10:17:13Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:algorithm:
            .: {}
            f:algorithmName: {}
          f:maxFailedTrialCount: {}
          f:maxTrialCount: {}
          f:nasConfig:
            .: {}
            f:graphConfig:
              .: {}
              f:inputSizes: {}
              f:numLayers: {}
              f:outputSizes: {}
            f:operations: {}
          f:objective:
            .: {}
            f:goal: {}
            f:objectiveMetricName: {}
            f:type: {}
          f:parallelTrialCount: {}
          f:trialTemplate:
            .: {}
            f:primaryContainerName: {}
            f:trialParameters: {}
            f:trialSpec:
              .: {}
              f:apiVersion: {}
              f:kind: {}
              f:spec:
                .: {}
                f:template:
                  .: {}
                  f:spec:
                    .: {}
                    f:containers: {}
                    f:restartPolicy: {}
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2025-01-17T10:17:24Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:conditions: {}
          f:currentOptimalTrial:
            .: {}
            f:observation: {}
          f:runningTrialList: {}
          f:startTime: {}
          f:trials: {}
          f:trialsRunning: {}
      subresource: status
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: Validation-Accuracy
    metricStrategies:
      - name: Validation-Accuracy
        value: max
  algorithm:
    algorithmName: enas
  trialTemplate:
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - command:
                  - python3
                  - '-u'
                  - RunTrial.py
                  - '--num_epochs=1'
                  - >-
                    --architecture="${trialParameters.neuralNetworkArchitecture}"
                  - '--nn_config="${trialParameters.neuralNetworkConfig}"'
                image: docker.io/kubeflowkatib/enas-cnn-cifar10-cpu:latest
                name: training-container
            restartPolicy: Never
    trialParameters:
      - name: neuralNetworkArchitecture
        description: >-
          NN architecture contains operations ID on each NN layer and skip
          connections between layers
        reference: architecture
      - name: neuralNetworkConfig
        description: >-
          Configuration contains NN number of layers, input and output sizes,
          description what each operation ID means
        reference: nn_config
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
  parallelTrialCount: 2
  maxTrialCount: 3
  maxFailedTrialCount: 2
  metricsCollectorSpec:
    collector:
      kind: StdOut
  nasConfig:
    graphConfig:
      numLayers: 1
      inputSizes:
        - 32
        - 32
        - 3
      outputSizes:
        - 10
    operations:
      - operationType: convolution
        parameters:
          - name: filter_size
            parameterType: categorical
            feasibleSpace:
              list:
                - '3'
                - '5'
                - '7'
          - name: num_filter
            parameterType: categorical
            feasibleSpace:
              list:
                - '32'
                - '48'
                - '64'
                - '96'
                - '128'
          - name: stride
            parameterType: categorical
            feasibleSpace:
              list:
                - '1'
                - '2'
      - operationType: separable_convolution
        parameters:
          - name: filter_size
            parameterType: categorical
            feasibleSpace:
              list:
                - '3'
                - '5'
                - '7'
          - name: num_filter
            parameterType: categorical
            feasibleSpace:
              list:
                - '32'
                - '48'
                - '64'
                - '96'
                - '128'
          - name: stride
            parameterType: categorical
            feasibleSpace:
              list:
                - '1'
                - '2'
          - name: depth_multiplier
            parameterType: categorical
            feasibleSpace:
              list:
                - '1'
                - '2'
      - operationType: depthwise_convolution
        parameters:
          - name: filter_size
            parameterType: categorical
            feasibleSpace:
              list:
                - '3'
                - '5'
                - '7'
          - name: stride
            parameterType: categorical
            feasibleSpace:
              list:
                - '1'
                - '2'
          - name: depth_multiplier
            parameterType: categorical
            feasibleSpace:
              list:
                - '1'
                - '2'
      - operationType: reduction
        parameters:
          - name: reduction_type
            parameterType: categorical
            feasibleSpace:
              list:
                - max_pooling
                - avg_pooling
          - name: pool_size
            parameterType: int
            feasibleSpace:
              max: '3'
              min: '2'
              step: '1'
  resumePolicy: Never
status:
  startTime: '2025-01-17T10:17:13Z'
  conditions:
    - type: Created
      status: 'True'
      reason: ExperimentCreated
      message: Experiment is created
      lastUpdateTime: '2025-01-17T10:17:13Z'
      lastTransitionTime: '2025-01-17T10:17:13Z'
    - type: Running
      status: 'True'
      reason: ExperimentRunning
      message: Experiment is running
      lastUpdateTime: '2025-01-17T10:17:24Z'
      lastTransitionTime: '2025-01-17T10:17:24Z'
  currentOptimalTrial:
    observation: {}
  runningTrialList:
    - enas-cpu-cbndflxf
    - enas-cpu-66jqk54g
  trials: 2
  trialsRunning: 2

YAML for Trial from Katib UI

metadata:
  name: enas-cpu-66jqk54g
  namespace: moderation
  uid: f4f2eeea-11a3-4637-b190-b3b9c7190836
  resourceVersion: '5141540'
  generation: 1
  creationTimestamp: '2025-01-17T10:17:24Z'
  labels:
    katib.kubeflow.org/experiment: enas-cpu
  ownerReferences:
    - apiVersion: kubeflow.org/v1beta1
      kind: Experiment
      name: enas-cpu
      uid: 39920b93-2bc8-40bb-9565-ec8c24c361d2
      controller: true
      blockOwnerDeletion: true
  finalizers:
    - clean-metrics-in-db
  managedFields:
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2025-01-17T10:17:24Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            .: {}
            v:"clean-metrics-in-db": {}
          f:labels:
            .: {}
            f:katib.kubeflow.org/experiment: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"39920b93-2bc8-40bb-9565-ec8c24c361d2"}: {}
        f:spec:
          .: {}
          f:failureCondition: {}
          f:metricsCollector:
            .: {}
            f:collector:
              .: {}
              f:kind: {}
          f:objective:
            .: {}
            f:goal: {}
            f:metricStrategies: {}
            f:objectiveMetricName: {}
            f:type: {}
          f:parameterAssignments: {}
          f:primaryContainerName: {}
          f:runSpec:
            .: {}
            f:apiVersion: {}
            f:kind: {}
            f:metadata:
              .: {}
              f:name: {}
              f:namespace: {}
            f:spec:
              .: {}
              f:template:
                .: {}
                f:spec:
                  .: {}
                  f:containers: {}
                  f:restartPolicy: {}
          f:successCondition: {}
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2025-01-17T10:17:24Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:conditions: {}
          f:startTime: {}
      subresource: status
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: Validation-Accuracy
    metricStrategies:
      - name: Validation-Accuracy
        value: max
  parameterAssignments:
    - name: architecture
      value: '[[11]]'
    - name: nn_config
      value: >-
        {'num_layers': 1, 'input_sizes': [32, 32, 3], 'output_sizes': [10],
        'embedding': {'11': {'opt_id': 11, 'opt_type': 'convolution',
        'opt_params': {'filter_size': '5', 'num_filter': '32', 'stride': '2'}}}}
  runSpec:
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: enas-cpu-66jqk54g
      namespace: moderation
    spec:
      template:
        spec:
          containers:
            - command:
                - python3
                - '-u'
                - RunTrial.py
                - '--num_epochs=1'
                - '--architecture="[[11]]"'
                - >-
                  --nn_config="{'num_layers': 1, 'input_sizes': [32, 32, 3],
                  'output_sizes': [10], 'embedding': {'11': {'opt_id': 11,
                  'opt_type': 'convolution', 'opt_params': {'filter_size': '5',
                  'num_filter': '32', 'stride': '2'}}}}"
              image: docker.io/kubeflowkatib/enas-cnn-cifar10-cpu:latest
              name: training-container
          restartPolicy: Never
  metricsCollector:
    collector:
      kind: StdOut
  primaryContainerName: training-container
  successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
  failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
status:
  startTime: '2025-01-17T10:17:24Z'
  conditions:
    - type: Created
      status: 'True'
      reason: TrialCreated
      message: Trial is created
      lastUpdateTime: '2025-01-17T10:17:24Z'
      lastTransitionTime: '2025-01-17T10:17:24Z'
    - type: Running
      status: 'True'
      reason: TrialRunning
      message: Trial is running
      lastUpdateTime: '2025-01-17T10:17:24Z'
      lastTransitionTime: '2025-01-17T10:17:24Z'

What did you expect to happen?

The experiment should complete without any issues.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.31.4-eks-2d5f260

Katib controller version:

$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
docker.io/kubeflowkatib/katib-controller:latest

Katib Python SDK version:

$ pip show kubeflow-katib
Name: kubeflow-katib
Version: 0.17.0
Summary: Katib Python SDK for APIVersion v1beta1
Home-page: https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /opt/tensorflow/lib/python3.12/site-packages
Requires: certifi, grpcio, kubernetes, protobuf, setuptools, six, urllib3
Required-by:

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

The text was updated successfully, but these errors were encountered:

Electronic-Waste · 2025-01-21T10:33:01Z

/remove-label lifecycle/needs-triage
/area nas
/help

/cc @kubeflow/wg-automl-leads

google-oss-prow · 2025-01-21T10:33:04Z

@Electronic-Waste:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/remove-label lifecycle/needs-triage
/area nas
/help

/cc @kubeflow/wg-automl-leads

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

shubham-ojha-weheal added kind/bug lifecycle/needs-triage labels Jan 17, 2025

google-oss-prow bot added area/nas help wanted Extra attention is needed and removed lifecycle/needs-triage labels Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not run ENAS experiment #2494

Can not run ENAS experiment #2494

shubham-ojha-weheal commented Jan 17, 2025 •

edited

Loading

Electronic-Waste commented Jan 21, 2025

google-oss-prow bot commented Jan 21, 2025

Can not run ENAS experiment #2494

Can not run ENAS experiment #2494

Comments

shubham-ojha-weheal commented Jan 17, 2025 • edited Loading

What happened?

What did you expect to happen?

Environment

Impacted by this bug?

Electronic-Waste commented Jan 21, 2025

google-oss-prow bot commented Jan 21, 2025

shubham-ojha-weheal commented Jan 17, 2025 •

edited

Loading