Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics not reporting to Katib server - experiment timing out #1905

Closed
farisfirenze opened this issue Jun 23, 2022 · 5 comments
Closed

Metrics not reporting to Katib server - experiment timing out #1905

farisfirenze opened this issue Jun 23, 2022 · 5 comments

Comments

@farisfirenze
Copy link

farisfirenze commented Jun 23, 2022

I am trying to create an experiment in kubeflow pipeline using python where I can hyperparameter tune an simple script. I want to use Katib to tune the hyper parameter using python (not by applying YAML file). The problem is that I cant report the metrics to katib server. Since the report is not happening, the experiment is timing out. So I need some help from the community.

Here is what I have tried :

  1. I created a GKE cluster and installed katib and training-operator and kubeflow pipelines on it.
  2. I tried to create an experiment using TFJob as given below:

trial_spec = {
    "apiVersion": "kubeflow.org/v1",
    "kind": "TFJob",
    "spec": {
        "tfReplicaSpecs": {
            "PS": {
                "replicas": 1,
                "restartPolicy": "Never",
                "template": {
                    "spec": {
                        "containers": [
                            {
                                "name": "tensorflow",
                                "image": "<image_name>",
                                "command": [
                                    "python",
                                    "/opt/trainer/task.py",
                                    "--epoch=${trialParameters.epoch}",
                                    "--batch_size=${trialParameters.batchSize}"
                                ]
                            }
                        ]
                    }
                }
            },
            "Worker": {
                "replicas": 1,
                "restartPolicy": "Never",
                "template": {
                    "spec": {
                        "containers": [
                            {
                                "name": "tensorflow",
                                "image": "<image_name>",
                                "command": [
                                    "python",
                                    "/opt/trainer/task.py",
                                    "--epoch=${trialParameters.epoch}",
                                    "--batch_size=${trialParameters.batchSize}"
                                ]
                            }
                        ]
                    }
                }
            }
        }
    }
}

The above given JSON is my trial spec. I have given the entire pipeline code below:

import kfp
import kfp.dsl as dsl
from kfp import components

from kubeflow.katib import ApiClient
from kubeflow.katib import V1beta1ExperimentSpec
from kubeflow.katib import V1beta1AlgorithmSpec
from kubeflow.katib import V1beta1ObjectiveSpec
from kubeflow.katib import V1beta1ParameterSpec
from kubeflow.katib import V1beta1FeasibleSpace
from kubeflow.katib import V1beta1TrialTemplate
from kubeflow.katib import V1beta1TrialParameterSpec
from kubeflow.katib import V1beta1MetricsCollectorSpec
from kubeflow.katib import V1beta1CollectorSpec
from kubeflow.katib import V1beta1FileSystemPath
from kubeflow.katib import V1beta1SourceSpec
from kubeflow.katib import V1beta1FilterSpec
# experiment_name = "tf-test17"
# experiment_namespace = "kubeflow"

# Trial count specification.
max_trial_count = 2
max_failed_trial_count = 2
parallel_trial_count = 1

# Objective specification.
objective = V1beta1ObjectiveSpec(
    type="minimize",
    # goal=100,
    objective_metric_name="loss"
    # additional_metric_names=["accuracy"]
)


# Objective specification.
metrics_collector_specs = V1beta1MetricsCollectorSpec(
    collector=V1beta1CollectorSpec(kind="File"),
    source=V1beta1SourceSpec(
        file_system_path=V1beta1FileSystemPath(
            # format="TEXT",
            path="/opt/trainer/katib/metrics.log",
            kind="File"
        ),
        filter=V1beta1FilterSpec(
            metrics_format=["{metricName: ([\\w|-]+), metricValue: ((-?\\d+)(\\.\\d+)?)}"]
        
        )
    )
)

# Algorithm specification.
algorithm = V1beta1AlgorithmSpec(
    algorithm_name="random",
)

# Experiment search space.
# In this example we tune learning rate and batch size.
parameters = [
    V1beta1ParameterSpec(
        name="epoch",
        parameter_type="int",
        feasible_space=V1beta1FeasibleSpace(
            min="5",
            max="12"
        ),
    ),
    V1beta1ParameterSpec(
        name="batch_size",
        parameter_type="int",
        feasible_space=V1beta1FeasibleSpace(
            min="12",
            max="32"
        ),
    )
]

# Experiment Trial template.


# TODO (andreyvelich): Use community image for the mnist example.
trial_spec = {
    "apiVersion": "kubeflow.org/v1",
    "kind": "TFJob",
    "spec": {
        "tfReplicaSpecs": {
            "PS": {
                "replicas": 1,
                "restartPolicy": "Never",
                "template": {
                    "spec": {
                        "containers": [
                            {
                                "name": "tensorflow",
                                "image": "<image_name>",
                                "command": [
                                    "python",
                                    "/opt/trainer/task.py",
                                    "--epoch=${trialParameters.epoch}",
                                    "--batch_size=${trialParameters.batchSize}"
                                ]
                            }
                        ]
                    }
                }
            },
            "Worker": {
                "replicas": 1,
                "restartPolicy": "Never",
                "template": {
                    "spec": {
                        "containers": [
                            {
                                "name": "tensorflow",
                                "image": "<image_name>",
                                "command": [
                                    "python",
                                    "/opt/trainer/task.py",
                                    "--epoch=${trialParameters.epoch}",
                                    "--batch_size=${trialParameters.batchSize}"
                                ]
                            }
                        ]
                    }
                }
            }
        }
    }
}

# Configure parameters for the Trial template.
trial_template = V1beta1TrialTemplate(
    primary_container_name="tensorflow",
    trial_parameters=[
        V1beta1TrialParameterSpec(
            name="epoch",
            description="epoch",
            reference="epoch"
        ),
        V1beta1TrialParameterSpec(
            name="batchSize",
            description="Batch size for the model",
            reference="batch_size"
        ),
    ],
    trial_spec=trial_spec
)

# Create an Experiment from the above parameters.
experiment_spec = V1beta1ExperimentSpec(
    max_trial_count=max_trial_count,
    max_failed_trial_count=max_failed_trial_count,
    parallel_trial_count=parallel_trial_count,
    # metrics_collector_spec=metrics_collector_specs,
    objective=objective,
    algorithm=algorithm,
    parameters=parameters,
    trial_template=trial_template
)

# Create the KFP task for the Katib Experiment.
# Experiment Spec should be serialized to a valid Kubernetes object.
katib_experiment_launcher_op = components.load_component_from_file("component.yaml")


@dsl.pipeline(
    name="Launch Katib early stopping Experiment",
    description="An example to launch Katib Experiment with early stopping"
)

def pipeline_func(

    experiment_name: str = "tf-test-1",
    experiment_namespace: str = "kubeflow",
    experiment_timeout_minutes: int = 5
):

    # Katib launcher component.
    # Experiment Spec should be serialized to a valid Kubernetes object.
    op = katib_experiment_launcher_op(
        experiment_name=experiment_name,
        experiment_namespace=experiment_namespace,
        experiment_spec=ApiClient().sanitize_for_serialization(experiment_spec),
        experiment_timeout_minutes=experiment_timeout_minutes,
        delete_finished_experiment=False)

    
    # restricting the maximum usable memory and cpu for preprocess stage
    op.set_memory_limit("8G")
    op.set_cpu_limit("1")
    
    # Output container to print the results.
    op_out = dsl.ContainerOp(
        name="best-hp",
        image="library/bash:4.4.23",
        command=["sh", "-c"],
        arguments=["echo Best HyperParameters: %s" % op.output],
    )
    
    op_out.set_memory_limit("4G")
    op_out.set_cpu_limit("1")

if __name__ == '__main__':
    
    # compiling the model and generating tar.gz file to upload to Kubeflow Pipeline UI
    import kfp.compiler as compiler

    compiler.Compiler().compile(
        pipeline_func, 'pipeline_tf_text.tar.gz'
    )

and this is my /opt/trainer/task.py code

import os
import re
import shutil
import string
import tensorflow as tf
import random
from tensorflow.keras import layers
from tensorflow.keras import losses
import argparse
import logging
from google.cloud import storage

os.mkdir("katib/")

logging.basicConfig(
            format="%(asctime)s %(levelname)-8s %(message)s",
            datefmt="%Y-%m-%dT%H:%M:%SZ",
            level=logging.DEBUG,
            filename="katib/metrics.log")


if __name__ == "__main__":
    
    try: 
        list1 = [1245.99, 7554.00, 725.66, 546.88, 423.99, 7866.00]
        loss = random.choice(list1)
        logging.info("{{metricName: loss, metricValue: {:.4f}}};{{metricName: accuracy, metricValue: {:.4f}}}\n".format(loss, loss))
        
    except Exception as e:
        print(e)
     

The task.py code had more training logic. I have removed it since I only have problem with reporting some metrics to the katib server. Since I am simply taking some random number from the list and reporting to katib server, I am hoping it should work but it doesnt.

  1. I have also tried to give the path in file_system_path.path as /katib/metrics.log as given below.
# Objective specification.
metrics_collector_specs = V1beta1MetricsCollectorSpec(
    collector=V1beta1CollectorSpec(kind="File"),
    source=V1beta1SourceSpec(
        file_system_path=V1beta1FileSystemPath(
            # format="TEXT",
            path="katib/metrics.log",
            kind="File"
        ),
        filter=V1beta1FilterSpec(
            metrics_format=["{metricName: ([\\w|-]+), metricValue: ((-?\\d+)(\\.\\d+)?)}"]

        )
    )
)

This gives me an error as given below:

time="2022-06-23T05:53:55.247Z" level=info msg="capturing logs" argo=true
INFO:root:Creating Experiment: tf-test-1 in namespace: kubeflow
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/kubeflow/katib/api/katib_client.py", line 74, in create_experiment
    exp_object)
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/apis/custom_objects_api.py", line 178, in create_namespaced_custom_object
    (data) = self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/apis/custom_objects_api.py", line 277, in create_namespaced_custom_object_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 334, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 377, in request
    body=body)
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/rest.py", line 266, in POST
    body=body)
  File "/usr/local/lib/python3.6/site-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': '251506ca-f8e0-487a-9bea-9c87f435991b', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '4f14223c-b7ef-4a61-935e-760caef0517b', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a0b375b1-ec21-45bd-a615-84415d25710b', 'Date': 'Thu, 23 Jun 2022 05:53:56 GMT', 'Content-Length': '279'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validator.experiment.katib.kubeflow.org\" denied the request: file path where metrics file exists is required by .spec.metricsCollectorSpec.source.fileSystemPath.path","code":400}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "src/launch_experiment.py", line 115, in <module>
    output = katib_client.create_experiment(experiment, namespace=experiment_namespace)
  File "/usr/local/lib/python3.6/site-packages/kubeflow/katib/api/katib_client.py", line 78, in create_experiment
    %s\n" % e)
RuntimeError: Exception when calling CustomObjectsApi->create_namespaced_custom_object:         (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': '251506ca-f8e0-487a-9bea-9c87f435991b', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '4f14223c-b7ef-4a61-935e-760caef0517b', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a0b375b1-ec21-45bd-a615-84415d25710b', 'Date': 'Thu, 23 Jun 2022 05:53:56 GMT', 'Content-Length': '279'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validator.experiment.katib.kubeflow.org\" denied the request: file path where metrics file exists is required by .spec.metricsCollectorSpec.source.fileSystemPath.path","code":400}
time="2022-06-23T05:53:56.361Z" level=error msg="cannot save parameter /tmp/outputs/Best_Parameter_Set/data" argo=true error="open /tmp/outputs/Best_Parameter_Set/data: no such file or directory"
time="2022-06-23T05:53:56.361Z" level=error msg="cannot save artifact /tmp/outputs/Best_Parameter_Set/data" argo=true error="stat /tmp/outputs/Best_Parameter_Set/data: no such file or directory"
Error: exit status 1

So I changed the path to give the full path like /opt/trainer/katib/metrics.log. The experiment just time out when doing so.

  1. I have also tried changing trial spec as follows :
trial_spec = {
    "apiVersion": "batch/v1",
    "kind": "Job",
    "spec": {
        "template": {
            "spec": {
                "containers": [
                    {
                        "name": "tensorflow",
                        "image": "<image_name>",
                        "command": [
                            "python",
                            "/opt/trainer/task.py",
                            "--epoch=${trialParameters.epoch}",
                            "--batch_size=${trialParameters.batchSize}"
                        ]
                    }
                ]
            }
        }
    }
}

FYI: the katib can get into my container and I can see the pods log saying its succeeded. but the metrics is not reporting. I would like some help from the community ASAP please.

Please comment if you need any more information. I have tried many other things but I cant post all the things here.

This is how I created my cluster and done all the installation

CLUSTER_NAME="kubeflow-pipelines-standalone-v2"
ZONE="us-central1-a"
MACHINE_TYPE="n1-standard-4"
SCOPES="cloud-platform"
NODES_NUM=1

gcloud container clusters create $CLUSTER_NAME --zone $ZONE --machine-type $MACHINE_TYPE --scopes $SCOPES --num-nodes $NODES_NUM

gcloud config set compute/zone $ZONE
gcloud container clusters get-credentials $CLUSTER_NAME

export PIPELINE_VERSION=1.8.1
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=$PIPELINE_VERSION"
# katib
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=master"
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.3.0"
kubectl apply -f ./test.yaml

test.yaml file

apiVersion: v1
kind: Namespace
metadata:
  name: kubeflow
  labels:
    katib.kubeflow.org/metrics-collector-injection: enabled

references:

  1. https://github.com/kubeflow/katib/blob/master/examples/v1beta1/sdk/nas-with-darts.ipynb
  2. https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector
  3. https://github.com/kubeflow/katib/blob/master/examples/v1beta1/metrics-collector/file-metrics-collector.yaml#L13-L22
  4. https://github.com/kubeflow/pipelines/blob/master/components/kubeflow/katib-launcher/component.yaml
@johnugeorge
Copy link
Member

johnugeorge commented Jun 23, 2022

Can you provide

kubectl get trials -n kubeflow
kubectl get tfjobs -n kubeflow
kubectl get pods -n kubeflow

Can you also provide final experiment config by kubectl get experiment -o yaml ? I can try reproducing it

@farisfirenze
Copy link
Author

farisfirenze commented Jun 23, 2022

Can you provide

kubectl get trials -n kubeflow kubectl get tfjobs -n kubeflow kubectl get pods -n kubeflow

Can you also provide final experiment config by kubectl get experiment -o yaml ? I can try reproducing it

(base) jupyter@tensorflow-2-6-new:~/katib/boston/sample_gke$ kubectl get trials -n kubeflow
NAME                 TYPE      STATUS   AGE
tf-test-2-k24lz692   Running   True     3h23m
tf-test-3-6kwcphhp   Created   True     3h16m
tf-test-4-qzrlhnpj   Created   True     3h7m
(base) jupyter@tensorflow-2-6-new:~/katib/boston/sample_gke$ kubectl get tfjobs -n kubeflow
NAME                 AGE
tf-test-2-k24lz692   3h24m
(base) jupyter@tensorflow-2-6-new:~/katib/boston/sample_gke$ kubectl get pods -n kubeflow
NAME                                                      READY   STATUS      RESTARTS        AGE
cache-deployer-deployment-54d9945778-pb44d                1/1     Running     0               3h35m
cache-server-64ff7d6cc5-bp5lq                             1/1     Running     0               3h35m
controller-manager-6d7b565545-6v27r                       1/1     Running     0               3h35m
katib-cert-generator-jhnww                                0/1     Completed   0               3h35m
katib-controller-694d8f5b89-tz7fk                         1/1     Running     0               3h35m
katib-db-manager-57cd769cdb-mnkrm                         1/1     Running     0               3h35m
katib-mysql-5bf95ddfcc-qwf7x                              1/1     Running     0               3h35m
katib-ui-5767cfccdc-s8qv6                                 1/1     Running     0               3h35m
launch-katib-early-stopping-experiment-4zfpl-2933989764   0/2     Error       0               3h25m
launch-katib-early-stopping-experiment-8dfrf-3055254602   0/2     Error       0               3h26m
launch-katib-early-stopping-experiment-f56b6-2833276051   0/2     Error       0               3h18m
launch-katib-early-stopping-experiment-fqtj5-3738587822   0/2     Error       0               3h9m
metadata-envoy-deployment-7b847ff6c5-cqds8                1/1     Running     0               3h35m
metadata-grpc-deployment-f8d68f687-mcnrv                  1/1     Running     5 (3h31m ago)   3h35m
metadata-writer-85f85cb76d-28grp                          1/1     Running     1 (3h30m ago)   3h35m
minio-5b65df66c9-fp2j7                                    1/1     Running     0               3h35m
ml-pipeline-55bd758bdc-q7v9p                              1/1     Running     0               3h35m
ml-pipeline-persistenceagent-89cd545c6-x72sq              1/1     Running     1 (3h32m ago)   3h35m
ml-pipeline-scheduledworkflow-8dffb8d47-qlzfv             1/1     Running     0               3h35m
ml-pipeline-ui-c4767db7c-x6hbb                            1/1     Running     0               3h35m
ml-pipeline-viewer-crd-68bdd5f6f5-nmh8p                   1/1     Running     0               3h35m
ml-pipeline-visualizationserver-66979c8f66-44qfv          1/1     Running     0               3h35m
mysql-f7b9b7dd4-649sh                                     1/1     Running     0               3h35m
proxy-agent-59dc7848f-sz5tq                               1/1     Running     0               3h35m
tf-test-2-k24lz692-ps-0                                   0/1     Completed   0               3h24m
tf-test-2-k24lz692-worker-0                               0/1     Completed   0               3h24m
tf-test-2-random-65cc4db558-kj9jb                         1/1     Running     0               3h25m
tf-test-3-random-68d5bf8869-sh79v                         1/1     Running     0               3h17m
tf-test-4-random-7c58968756-76wgq                         1/1     Running     0               3h9m
training-operator-7d98f9dd88-4vsbh                        1/1     Running     0               3h35m
workflow-controller-8f7b8c697-jmrtp                       1/1     Running     0               3h35m
(base) jupyter@tensorflow-2-6-new:~/katib/boston/sample_gke$ kubectl get experiment -o yaml
apiVersion: v1
items: []
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

@johnugeorge
Copy link
Member

Your experiment seems to be in the kubeflow namespace. Can you provide kubectl get experiment -o yaml -n kubeflow

I see that you use Katib from master branch, while training operator from 1.3 release.
Can you try latest released versions? Katib v0.13.0 and training operator v1.4.0 There were few changes in labels from operator side. https://github.com/kubeflow/katib/releases/tag/v0.13.0

@farisfirenze
Copy link
Author

farisfirenze commented Jun 24, 2022

i

(base) jupyter@tensorflow-2-6-new:~/katib/boston/sample_gke$ kubectl get experiment -o yaml -n kubeflow
apiVersion: v1
items:
- apiVersion: kubeflow.org/v1beta1
  kind: Experiment
  metadata:
    creationTimestamp: "2022-06-23T05:55:21Z"
    finalizers:
    - update-prometheus-metrics
    generation: 1
    name: tf-test-2
    namespace: kubeflow
    resourceVersion: "6748"
    uid: 31011d70-19b1-4556-9a9a-590239a109c2
  spec:
    algorithm:
      algorithmName: random
    maxFailedTrialCount: 2
    maxTrialCount: 2
    metricsCollectorSpec:
      collector:
        kind: File
      source:
        fileSystemPath:
          format: TEXT
          kind: File
          path: /opt/trainer/katib/metrics.log
        filter:
          metricsFormat:
          - '{metricName: ([\w|-]+), metricValue: ((-?\d+)(\.\d+)?)}'
    objective:
      additionalMetricNames:
      - accuracy
      metricStrategies:
      - name: loss
        value: min
      - name: accuracy
        value: min
      objectiveMetricName: loss
      type: minimize
    parallelTrialCount: 1
    parameters:
    - feasibleSpace:
        max: "12"
        min: "5"
      name: epoch
      parameterType: int
    - feasibleSpace:
        max: "32"
        min: "12"
      name: batch_size
      parameterType: int
    resumePolicy: LongRunning
    trialTemplate:
      failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
      primaryContainerName: tensorflow
      primaryPodLabels:
        training.kubeflow.org/job-role: master
      successCondition: status.conditions.#(type=="Succeeded")#|#(status=="True")#
      trialParameters:
      - description: epoch
        name: epoch
        reference: epoch
      - description: Batch size for the model
        name: batchSize
        reference: batch_size
      trialSpec:
        apiVersion: kubeflow.org/v1
        kind: TFJob
        spec:
          tfReplicaSpecs:
            PS:
              replicas: 1
              restartPolicy: Never
              template:
                metadata:
                  annotations:
                    sidecar.istio.io/inject: "false"
                spec:
                  containers:
                  - command:
                    - python
                    - /opt/trainer/task.py
                    - --epoch=${trialParameters.epoch}
                    - --batch_size=${trialParameters.batchSize}
                    image: gcr.io/prj-vertex-ai/textclasstf:v18
                    name: tensorflow
            Worker:
              replicas: 1
              restartPolicy: Never
              template:
                metadata:
                  annotations:
                    sidecar.istio.io/inject: "false"
                spec:
                  containers:
                  - command:
                    - python
                    - /opt/trainer/task.py
                    - --epoch=${trialParameters.epoch}
                    - --batch_size=${trialParameters.batchSize}
                    image: gcr.io/prj-vertex-ai/textclasstf:v18
                    name: tensorflow
  status:
    conditions:
    - lastTransitionTime: "2022-06-23T05:55:21Z"
      lastUpdateTime: "2022-06-23T05:55:21Z"
      message: Experiment is created
      reason: ExperimentCreated
      status: "True"
      type: Created
    - lastTransitionTime: "2022-06-23T05:55:52Z"
      lastUpdateTime: "2022-06-23T05:55:52Z"
      message: Experiment is running
      reason: ExperimentRunning
      status: "True"
      type: Running
    currentOptimalTrial:
      observation: {}
    runningTrialList:
    - tf-test-2-k24lz692
    startTime: "2022-06-23T05:55:21Z"
    trials: 1
    trialsRunning: 1
- apiVersion: kubeflow.org/v1beta1
  kind: Experiment
  metadata:
    creationTimestamp: "2022-06-23T06:02:25Z"
    finalizers:
    - update-prometheus-metrics
    generation: 1
    name: tf-test-3
    namespace: kubeflow
    resourceVersion: "9434"
    uid: f7abc751-aa26-44e7-ab29-99c318f596ac
  spec:
    algorithm:
      algorithmName: random
    maxFailedTrialCount: 2
    maxTrialCount: 2
    metricsCollectorSpec:
      collector:
        kind: File
      source:
        fileSystemPath:
          format: TEXT
          kind: File
          path: /opt/trainer/katib/metrics.log
        filter:
          metricsFormat:
          - '{metricName: ([\w|-]+), metricValue: ((-?\d+)(\.\d+)?)}'
    objective:
      additionalMetricNames:
      - accuracy
      metricStrategies:
      - name: loss
        value: min
      - name: accuracy
        value: min
      objectiveMetricName: loss
      type: minimize
    parallelTrialCount: 1
    parameters:
    - feasibleSpace:
        max: "12"
        min: "5"
      name: epoch
      parameterType: int
    - feasibleSpace:
        max: "32"
        min: "12"
      name: batch_size
      parameterType: int
    resumePolicy: LongRunning
    trialTemplate:
      failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
      primaryContainerName: tensorflow
      successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
      trialParameters:
      - description: epoch
        name: epoch
        reference: epoch
      - description: Batch size for the model
        name: batchSize
        reference: batch_size
      trialSpec:
        apiVersion: batch/v1
        kind: Job
        spec:
          template:
            spec:
              containers:
              - command:
                - python
                - /opt/trainer/task.py
                - --epoch=${trialParameters.epoch}
                - --batch_size=${trialParameters.batchSize}
                image: gcr.io/prj-vertex-ai/textclasstf:v18
                name: tensorflow
  status:
    conditions:
    - lastTransitionTime: "2022-06-23T06:02:25Z"
      lastUpdateTime: "2022-06-23T06:02:25Z"
      message: Experiment is created
      reason: ExperimentCreated
      status: "True"
      type: Created
    - lastTransitionTime: "2022-06-23T06:02:46Z"
      lastUpdateTime: "2022-06-23T06:02:46Z"
      message: Experiment is running
      reason: ExperimentRunning
      status: "True"
      type: Running
    currentOptimalTrial:
      observation: {}
    pendingTrialList:
    - tf-test-3-6kwcphhp
    startTime: "2022-06-23T06:02:25Z"
    trials: 1
    trialsPending: 1
- apiVersion: kubeflow.org/v1beta1
  kind: Experiment
  metadata:
    creationTimestamp: "2022-06-23T06:11:15Z"
    finalizers:
    - update-prometheus-metrics
    generation: 1
    name: tf-test-4
    namespace: kubeflow
    resourceVersion: "12888"
    uid: 2394b3e0-b23d-4fba-8e88-438e3648cfa7
  spec:
    algorithm:
      algorithmName: random
    maxFailedTrialCount: 2
    maxTrialCount: 2
    metricsCollectorSpec:
      collector:
        kind: StdOut
    objective:
      metricStrategies:
      - name: loss
        value: min
      objectiveMetricName: loss
      type: minimize
    parallelTrialCount: 1
    parameters:
    - feasibleSpace:
        max: "12"
        min: "5"
      name: epoch
      parameterType: int
    - feasibleSpace:
        max: "32"
        min: "12"
      name: batch_size
      parameterType: int
    resumePolicy: LongRunning
    trialTemplate:
      failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
      primaryContainerName: tensorflow
      successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
      trialParameters:
      - description: epoch
        name: epoch
        reference: epoch
      - description: Batch size for the model
        name: batchSize
        reference: batch_size
      trialSpec:
        apiVersion: batch/v1
        kind: Job
        spec:
          template:
            spec:
              containers:
              - command:
                - python
                - /opt/trainer/task.py
                - --epoch=${trialParameters.epoch}
                - --batch_size=${trialParameters.batchSize}
                image: gcr.io/prj-vertex-ai/textclasstf:v19
                name: tensorflow
  status:
    conditions:
    - lastTransitionTime: "2022-06-23T06:11:15Z"
      lastUpdateTime: "2022-06-23T06:11:15Z"
      message: Experiment is created
      reason: ExperimentCreated
      status: "True"
      type: Created
    - lastTransitionTime: "2022-06-23T06:11:36Z"
      lastUpdateTime: "2022-06-23T06:11:36Z"
      message: Experiment is running
      reason: ExperimentRunning
      status: "True"
      type: Running
    currentOptimalTrial:
      observation: {}
    pendingTrialList:
    - tf-test-4-qzrlhnpj
    startTime: "2022-06-23T06:11:15Z"
    trials: 1
    trialsPending: 1
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

and yes. I will try to install the versions you mentioned and see.

@farisfirenze
Copy link
Author

The problem was fixed after installing the specified versions. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants