MLFlow Model on MinIO Not Loading #2213

srajabi · 2020-07-29T22:45:53Z

Setup:

SeldonCore on Kubernetes
MLFlow (1.10.0) running with MinIO Storage
Jupyter notebooks on Kubernetes

Jupyter notebook generates a simple sklearn model, it's sent to MLFlow which stores it in MinIO. I'm now trying to get Seldon to create a deployment from this:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: minio-mlflow
spec:
  annotations:
    seldon.io/executor: "true"
  name: newsgroup_nb
  predictors:
  - componentSpecs:
     - spec:
        containers:
         - name: classifier
           livenessProbe:
              failureThreshold: 3
              initialDelaySeconds: 180
              periodSeconds: 5
              successThreshold: 1
              tcpSocket:
                port: http
              timeoutSeconds: 1
    graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://mlflow/artifacts/2/adaaee4b5c694f02b5ff9745c53ae75e/artifacts/nb
      envSecretRefName: seldon-init-container-secret
      name: classifier
    name: default
    replicas: 1

Container loads up, all the way to:

[2020-07-29 20:00:51 +0000] [6] [INFO] Using worker: threads
[2020-07-29 20:00:51 +0000] [931] [INFO] Booting worker with pid: 931
2020-07-29 20:00:51,947 - root:load:27 - INFO:  Downloading model from /mnt/models
2020-07-29 20:00:51,947 - root:download:47 - INFO:  Copying contents of /mnt/models to local
[2020-07-29 20:00:51 +0000] [931] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
    worker.init_process()
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/gunicorn/workers/gthread.py", line 92, in init_process
    super().init_process()
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/gunicorn/workers/base.py", line 119, in init_process
    self.load_wsgi()
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/gunicorn/workers/base.py", line 144, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/seldon_core/app.py", line 71, in load
    self.user_object.load()
  File "/microservice/MLFlowServer.py", line 29, in load
    self._model = pyfunc.load_model(model_folder)
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/mlflow/pyfunc/__init__.py", line 297, in load_model
    return importlib.import_module(conf[MAIN])._load_pyfunc(data_path)
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/mlflow/sklearn.py", line 230, in _load_pyfunc
    return _load_model_from_local_file(path)
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/mlflow/sklearn.py", line 217, in _load_model_from_local_file
    with open(path, "rb") as f:
IsADirectoryError: [Errno 21] Is a directory: '/mnt/models'
[2020-07-29 20:00:51 +0000] [931] [INFO] Worker exiting (pid: 931)
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/opt/conda/envs/mlflow/lib/python3.7/multiprocessing/util.py", line 322, in _exit_function
    p.join()
  File "/opt/conda/envs/mlflow/lib/python3.7/multiprocessing/process.py", line 138, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
[2020-07-29 20:00:51 +0000] [924] [INFO] Handling signal: term
[2020-07-29 20:00:51 +0000] [930] [INFO] Worker exiting (pid: 930)
[2020-07-29 20:00:52 +0000] [924] [INFO] Shutting down: Master
[2020-07-29 20:00:52 +0000] [6] [INFO] Shutting down: Master
[2020-07-29 20:00:52 +0000] [6] [INFO] Reason: Worker failed to boot.

Looking at what /mnt/models looks like:

bash-4.4$ ls
MLFlowServer.py  before-run  conda_env_create.py  image_metadata.json  license.txt  python  requirements.txt
bash-4.4$ ls /mnt/models/
MLmodel  conda.yaml  model.pkl

I can load this successfully via sk_model = mlflow.sklearn.load_model("s3://mlflow/artifacts/2/adaaee4b5c694f02b5ff9745c53ae75e/artifacts/nb")

Just not from Seldon. Any ideas? Am I missing something in setting this up?

The text was updated successfully, but these errors were encountered:

ukclivecox · 2020-07-30T07:00:27Z

strange. I see issues like this for empty directories: mlflow/mlflow#1881

adriangonz · 2020-08-17T15:02:12Z

Hey @srajabi, could you share the content of your MLmodel file?

mafs12 · 2020-08-26T14:32:52Z

I ran into the same issue as @srajabi

My deployment:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: wine-model
  namespace: seldon
spec:
  name: wines
  predictors:
  - componentSpecs:
    - spec:
        # We are setting high failureThreshold as installing conda dependencies
        # can take long time and we want to avoid k8s killing the container prematurely
        containers:
        - name: classifier
          livenessProbe:
            initialDelaySeconds: 150
            failureThreshold: 300
            periodSeconds: 10
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
          readinessProbe:
            initialDelaySeconds: 150
            failureThreshold: 300
            periodSeconds: 10
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
    graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://mlruns/1/4989e93b9b5b4bb9adf300e251bb4b7b/artifacts/model
      envSecretRefName: seldon-init-container-secret
      name: classifier
    name: default
    replicas: 1

Contents of /mnt/models directory:

sh-4.4$ ls /mnt/models/
MLmodel  conda.yaml  model.pkl

Contents of MLmodel file:

artifact_path: model
flavors:
  python_function:
    env: conda.yaml
    loader_module: mlflow.sklearn
    model_path: model.pkl
    python_version: 3.6.9
  sklearn:
    pickled_model: model.pkl
    serialization_format: cloudpickle
    sklearn_version: 0.23.2
run_id: 4989e93b9b5b4bb9adf300e251bb4b7b
utc_time_created: '2020-08-25 13:33:15.219554'

adriangonz · 2020-08-28T17:17:45Z

Hey @srajabi @mafs12, after looking a bit deeper on the MLflow side, it seems that their v1.10.0 release changed a few things on the MLmodel format. I couldn't find much documentation on it, but there is this comment describing the change in a bit more detail:

    if os.path.isfile(path):
        # Scikit-learn models saved in older versions of MLflow (<= 1.9.1) specify the ``data``
        # field within the pyfunc flavor configuration. For these older models, the ``path``
        # parameter of ``_load_pyfunc()`` refers directly to a serialized scikit-learn model
        # object. In this case, we assume that the serialization format is ``pickle``, since
        # the model loading procedure in older versions of MLflow used ``pickle.load()``.
        serialization_format = SERIALIZATION_FORMAT_PICKLE
    else:
        # In contrast, scikit-learn models saved in versions of MLflow > 1.9.1 do not
        # specify the ``data`` field within the pyfunc flavor configuration. For these newer
        # models, the ``path`` parameter of ``load_pyfunc()`` refers to the top-level MLflow
        # Model directory. In this case, we parse the model path from the MLmodel's pyfunc
        # flavor configuration and attempt to fetch the serialization format from the
        # scikit-learn flavor configuration

Based on that, this should be fixed by updating MLflow to the latest version in the MLFLOW_SERVER pre-packaged server in Seldon Core.

In the meantime, you can either:

Train / save your models using mlflow<=1.9.1.
Build your own version of the MLFLOW_SERVER wrapper, explicitly using mlflow>=1.10.0.

adriangonz · 2020-08-28T17:17:52Z

/priority p1

ukclivecox · 2020-09-14T07:08:53Z

I see this trying to run with mlflow model from gs://seldon-models/mlflow/diabetes which has MLmodel of

artifact_path: random-forest-model
flavors:
  python_function:
    env: conda.yaml
    loader_module: mlflow.sklearn
    model_path: model.pkl
    python_version: 3.7.7
  sklearn:
    pickled_model: model.pkl
    serialization_format: cloudpickle
    sklearn_version: 0.23.2
run_id: c8e400e1c4494a7fb718befdc64b825e
signature:
  inputs: '[{"type": "double"}, {"type": "double"}]'
  outputs: '[{"type": "double"}]'
utc_time_created: '2020-09-08 10:15:37.203662'

and conda.yaml

channels:
- defaults
- conda-forge
dependencies:
- python=3.7.7
- scikit-learn=0.23.2
- pip
- pip:
  - mlflow
  - cloudpickle==1.6.0
name: mlflow-env

2020-09-14T07:12:02.484556553Z     self._model = pyfunc.load_model(model_folder)                                                                                                           │
│ 2020-09-14T07:12:02.484559561Z   File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/mlflow/pyfunc/__init__.py", line 297, in load_model                                              │
│ 2020-09-14T07:12:02.484561777Z     return importlib.import_module(conf[MAIN])._load_pyfunc(data_path)                                                                                      │
│ 2020-09-14T07:12:02.484564086Z   File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/mlflow/sklearn.py", line 230, in _load_pyfunc                                                    │
│ 2020-09-14T07:12:02.484566394Z     return _load_model_from_local_file(path)                                                                                                                │
│ 2020-09-14T07:12:02.484569522Z   File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/mlflow/sklearn.py", line 217, in _load_model_from_local_file                                     │
│ 2020-09-14T07:12:02.484571722Z     with open(path, "rb") as f:                                                                                                                             │
│ 2020-09-14T07:12:02.484573938Z IsADirectoryError: [Errno 21] Is a directory: '/mnt/models'                                                                                                 │
│ 2020-09-14T07:12:02.484586117Z [2020-09-14 07:12:02 +0000] [956] [INFO] Worker exiting (pid: 956)                                                                                          │
│ 2020-09-14T07:12:02.485182479Z Error in atexit._run_exitfuncs:                                                                                                                             │
│ 2020-09-14T07:12:02.485192202Z Traceback (most recent call last):                                                                                                                          │
│ 2020-09-14T07:12:02.485195342Z   File "/opt/conda/envs/mlflow/lib/python3.7/multiprocessing/util.py", line 334, in _exit_function                                                          │
│ 2020-09-14T07:12:02.485197901Z     p.join()                                                                                                                                                │
│ 2020-09-14T07:12:02.485200345Z   File "/opt/conda/envs/mlflow/lib/python3.7/multiprocessing/process.py", line 138, in join                                                                 │
│ 2020-09-14T07:12:02.485202522Z     assert self._parent_pid == os.getpid(), 'can only join a child process'                                                                                 │
│ 2020-09-14T07:12:02.485204955Z AssertionError: can only join a child process

ukclivecox · 2020-09-14T07:30:13Z

The above error seems to have for the mlflow 1.8.0 version we have for our mlflow server but not 1.11.0

Subhraj07 · 2021-06-14T04:30:56Z

I am using MLFLOW version 1.17 and getting the following error when I am trying to use MLFLOW SERVER in seldon

ukclivecox · 2021-06-14T06:50:29Z

Have you tried ensuring you have the correct rclone settings?
See https://docs.seldon.io/projects/seldon-core/en/latest/servers/overview.html

If there is an issue can you open a new one.

seldondev added the priority/p1 label Aug 28, 2020

adriangonz self-assigned this Aug 28, 2020

adriangonz added this to the 1.3 milestone Aug 28, 2020

adriangonz mentioned this issue Sep 11, 2020

Update MLflow in MLFLOW_SERVER #2412

Merged

ukclivecox closed this as completed in #2412 Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLFlow Model on MinIO Not Loading #2213

MLFlow Model on MinIO Not Loading #2213

srajabi commented Jul 29, 2020 •

edited

Loading

ukclivecox commented Jul 30, 2020

adriangonz commented Aug 17, 2020

mafs12 commented Aug 26, 2020

adriangonz commented Aug 28, 2020

adriangonz commented Aug 28, 2020

ukclivecox commented Sep 14, 2020 •

edited

Loading

ukclivecox commented Sep 14, 2020

Subhraj07 commented Jun 14, 2021

ukclivecox commented Jun 14, 2021

MLFlow Model on MinIO Not Loading #2213

MLFlow Model on MinIO Not Loading #2213

Comments

srajabi commented Jul 29, 2020 • edited Loading

ukclivecox commented Jul 30, 2020

adriangonz commented Aug 17, 2020

mafs12 commented Aug 26, 2020

adriangonz commented Aug 28, 2020

adriangonz commented Aug 28, 2020

ukclivecox commented Sep 14, 2020 • edited Loading

ukclivecox commented Sep 14, 2020

Subhraj07 commented Jun 14, 2021

ukclivecox commented Jun 14, 2021

srajabi commented Jul 29, 2020 •

edited

Loading

ukclivecox commented Sep 14, 2020 •

edited

Loading