Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLFlow Model on MinIO Not Loading #2213

Closed
srajabi opened this issue Jul 29, 2020 · 9 comments · Fixed by #2412
Closed

MLFlow Model on MinIO Not Loading #2213

srajabi opened this issue Jul 29, 2020 · 9 comments · Fixed by #2412
Assignees
Milestone

Comments

@srajabi
Copy link

srajabi commented Jul 29, 2020

Setup:

  • SeldonCore on Kubernetes
  • MLFlow (1.10.0) running with MinIO Storage
  • Jupyter notebooks on Kubernetes

Jupyter notebook generates a simple sklearn model, it's sent to MLFlow which stores it in MinIO. I'm now trying to get Seldon to create a deployment from this:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: minio-mlflow
spec:
  annotations:
    seldon.io/executor: "true"
  name: newsgroup_nb
  predictors:
  - componentSpecs:
     - spec:
        containers:
         - name: classifier
           livenessProbe:
              failureThreshold: 3
              initialDelaySeconds: 180
              periodSeconds: 5
              successThreshold: 1
              tcpSocket:
                port: http
              timeoutSeconds: 1
    graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://mlflow/artifacts/2/adaaee4b5c694f02b5ff9745c53ae75e/artifacts/nb
      envSecretRefName: seldon-init-container-secret
      name: classifier
    name: default
    replicas: 1

Container loads up, all the way to:

[2020-07-29 20:00:51 +0000] [6] [INFO] Using worker: threads
[2020-07-29 20:00:51 +0000] [931] [INFO] Booting worker with pid: 931
2020-07-29 20:00:51,947 - root:load:27 - INFO:  Downloading model from /mnt/models
2020-07-29 20:00:51,947 - root:download:47 - INFO:  Copying contents of /mnt/models to local
[2020-07-29 20:00:51 +0000] [931] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
    worker.init_process()
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/gunicorn/workers/gthread.py", line 92, in init_process
    super().init_process()
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/gunicorn/workers/base.py", line 119, in init_process
    self.load_wsgi()
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/gunicorn/workers/base.py", line 144, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/seldon_core/app.py", line 71, in load
    self.user_object.load()
  File "/microservice/MLFlowServer.py", line 29, in load
    self._model = pyfunc.load_model(model_folder)
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/mlflow/pyfunc/__init__.py", line 297, in load_model
    return importlib.import_module(conf[MAIN])._load_pyfunc(data_path)
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/mlflow/sklearn.py", line 230, in _load_pyfunc
    return _load_model_from_local_file(path)
  File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/mlflow/sklearn.py", line 217, in _load_model_from_local_file
    with open(path, "rb") as f:
IsADirectoryError: [Errno 21] Is a directory: '/mnt/models'
[2020-07-29 20:00:51 +0000] [931] [INFO] Worker exiting (pid: 931)
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/opt/conda/envs/mlflow/lib/python3.7/multiprocessing/util.py", line 322, in _exit_function
    p.join()
  File "/opt/conda/envs/mlflow/lib/python3.7/multiprocessing/process.py", line 138, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
[2020-07-29 20:00:51 +0000] [924] [INFO] Handling signal: term
[2020-07-29 20:00:51 +0000] [930] [INFO] Worker exiting (pid: 930)
[2020-07-29 20:00:52 +0000] [924] [INFO] Shutting down: Master
[2020-07-29 20:00:52 +0000] [6] [INFO] Shutting down: Master
[2020-07-29 20:00:52 +0000] [6] [INFO] Reason: Worker failed to boot.

Looking at what /mnt/models looks like:

bash-4.4$ ls
MLFlowServer.py  before-run  conda_env_create.py  image_metadata.json  license.txt  python  requirements.txt
bash-4.4$ ls /mnt/models/
MLmodel  conda.yaml  model.pkl

I can load this successfully via sk_model = mlflow.sklearn.load_model("s3://mlflow/artifacts/2/adaaee4b5c694f02b5ff9745c53ae75e/artifacts/nb")

Just not from Seldon. Any ideas? Am I missing something in setting this up?

@ukclivecox
Copy link
Contributor

strange. I see issues like this for empty directories: mlflow/mlflow#1881

@adriangonz
Copy link
Contributor

Hey @srajabi, could you share the content of your MLmodel file?

@mafs12
Copy link

mafs12 commented Aug 26, 2020

I ran into the same issue as @srajabi

My deployment:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: wine-model
  namespace: seldon
spec:
  name: wines
  predictors:
  - componentSpecs:
    - spec:
        # We are setting high failureThreshold as installing conda dependencies
        # can take long time and we want to avoid k8s killing the container prematurely
        containers:
        - name: classifier
          livenessProbe:
            initialDelaySeconds: 150
            failureThreshold: 300
            periodSeconds: 10
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
          readinessProbe:
            initialDelaySeconds: 150
            failureThreshold: 300
            periodSeconds: 10
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
    graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://mlruns/1/4989e93b9b5b4bb9adf300e251bb4b7b/artifacts/model
      envSecretRefName: seldon-init-container-secret
      name: classifier
    name: default
    replicas: 1

Contents of /mnt/models directory:

sh-4.4$ ls /mnt/models/
MLmodel  conda.yaml  model.pkl 

Contents of MLmodel file:

artifact_path: model
flavors:
  python_function:
    env: conda.yaml
    loader_module: mlflow.sklearn
    model_path: model.pkl
    python_version: 3.6.9
  sklearn:
    pickled_model: model.pkl
    serialization_format: cloudpickle
    sklearn_version: 0.23.2
run_id: 4989e93b9b5b4bb9adf300e251bb4b7b
utc_time_created: '2020-08-25 13:33:15.219554'

@adriangonz
Copy link
Contributor

Hey @srajabi @mafs12, after looking a bit deeper on the MLflow side, it seems that their v1.10.0 release changed a few things on the MLmodel format. I couldn't find much documentation on it, but there is this comment describing the change in a bit more detail:

    if os.path.isfile(path):
        # Scikit-learn models saved in older versions of MLflow (<= 1.9.1) specify the ``data``
        # field within the pyfunc flavor configuration. For these older models, the ``path``
        # parameter of ``_load_pyfunc()`` refers directly to a serialized scikit-learn model
        # object. In this case, we assume that the serialization format is ``pickle``, since
        # the model loading procedure in older versions of MLflow used ``pickle.load()``.
        serialization_format = SERIALIZATION_FORMAT_PICKLE
    else:
        # In contrast, scikit-learn models saved in versions of MLflow > 1.9.1 do not
        # specify the ``data`` field within the pyfunc flavor configuration. For these newer
        # models, the ``path`` parameter of ``load_pyfunc()`` refers to the top-level MLflow
        # Model directory. In this case, we parse the model path from the MLmodel's pyfunc
        # flavor configuration and attempt to fetch the serialization format from the
        # scikit-learn flavor configuration

Based on that, this should be fixed by updating MLflow to the latest version in the MLFLOW_SERVER pre-packaged server in Seldon Core.

In the meantime, you can either:

  • Train / save your models using mlflow<=1.9.1.
  • Build your own version of the MLFLOW_SERVER wrapper, explicitly using mlflow>=1.10.0.

@adriangonz
Copy link
Contributor

/priority p1

@adriangonz adriangonz self-assigned this Aug 28, 2020
@adriangonz adriangonz added this to the 1.3 milestone Aug 28, 2020
@ukclivecox
Copy link
Contributor

ukclivecox commented Sep 14, 2020

I see this trying to run with mlflow model from gs://seldon-models/mlflow/diabetes which has MLmodel of

artifact_path: random-forest-model
flavors:
  python_function:
    env: conda.yaml
    loader_module: mlflow.sklearn
    model_path: model.pkl
    python_version: 3.7.7
  sklearn:
    pickled_model: model.pkl
    serialization_format: cloudpickle
    sklearn_version: 0.23.2
run_id: c8e400e1c4494a7fb718befdc64b825e
signature:
  inputs: '[{"type": "double"}, {"type": "double"}]'
  outputs: '[{"type": "double"}]'
utc_time_created: '2020-09-08 10:15:37.203662'

and conda.yaml

channels:
- defaults
- conda-forge
dependencies:
- python=3.7.7
- scikit-learn=0.23.2
- pip
- pip:
  - mlflow
  - cloudpickle==1.6.0
name: mlflow-env
2020-09-14T07:12:02.484556553Z     self._model = pyfunc.load_model(model_folder)                                                                                                           │
│ 2020-09-14T07:12:02.484559561Z   File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/mlflow/pyfunc/__init__.py", line 297, in load_model                                              │
│ 2020-09-14T07:12:02.484561777Z     return importlib.import_module(conf[MAIN])._load_pyfunc(data_path)                                                                                      │
│ 2020-09-14T07:12:02.484564086Z   File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/mlflow/sklearn.py", line 230, in _load_pyfunc                                                    │
│ 2020-09-14T07:12:02.484566394Z     return _load_model_from_local_file(path)                                                                                                                │
│ 2020-09-14T07:12:02.484569522Z   File "/opt/conda/envs/mlflow/lib/python3.7/site-packages/mlflow/sklearn.py", line 217, in _load_model_from_local_file                                     │
│ 2020-09-14T07:12:02.484571722Z     with open(path, "rb") as f:                                                                                                                             │
│ 2020-09-14T07:12:02.484573938Z IsADirectoryError: [Errno 21] Is a directory: '/mnt/models'                                                                                                 │
│ 2020-09-14T07:12:02.484586117Z [2020-09-14 07:12:02 +0000] [956] [INFO] Worker exiting (pid: 956)                                                                                          │
│ 2020-09-14T07:12:02.485182479Z Error in atexit._run_exitfuncs:                                                                                                                             │
│ 2020-09-14T07:12:02.485192202Z Traceback (most recent call last):                                                                                                                          │
│ 2020-09-14T07:12:02.485195342Z   File "/opt/conda/envs/mlflow/lib/python3.7/multiprocessing/util.py", line 334, in _exit_function                                                          │
│ 2020-09-14T07:12:02.485197901Z     p.join()                                                                                                                                                │
│ 2020-09-14T07:12:02.485200345Z   File "/opt/conda/envs/mlflow/lib/python3.7/multiprocessing/process.py", line 138, in join                                                                 │
│ 2020-09-14T07:12:02.485202522Z     assert self._parent_pid == os.getpid(), 'can only join a child process'                                                                                 │
│ 2020-09-14T07:12:02.485204955Z AssertionError: can only join a child process                                              

@ukclivecox
Copy link
Contributor

The above error seems to have for the mlflow 1.8.0 version we have for our mlflow server but not 1.11.0

@Subhraj07
Copy link

I am using MLFLOW version 1.17 and getting the following error when I am trying to use MLFLOW SERVER in seldon
Screenshot from 2021-06-14 09-20-32

@ukclivecox
Copy link
Contributor

Have you tried ensuring you have the correct rclone settings?
See https://docs.seldon.io/projects/seldon-core/en/latest/servers/overview.html

If there is an issue can you open a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants