Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of ML Registry Functionality #110

Merged
merged 8 commits into from
Nov 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,9 @@ This library contains functions that cover the following areas:
- Utility functions relating to areas including statistical analysis, data preprocessing and array manipulation.
- A multi-processing framework to parallelize work across many cores or nodes.
- Functions for seamless integration with PyKX or EmbedPy, which ensure seamless interoperability between Python and kdb+/q in either environment.
- A location for the storage and versioning of ML models on-prem along with a common model retrieval API allowing models regardless of underlying requirements to be retrieved and used on kdb+ data. This allows for enhanced team collaboration opportunities and management oversight by centralising team work to a common storage location.

These sections are explained in greater depth within the [FRESH](ml/docs/fresh.md), [cross validation](ml/docs/xval.md), [clustering](ml/docs/clustering/algos.md), [timeseries](ml/docs/timeseries/README.md), [optimization](ml/docs/optimize.md), [graph/pipeline](ml/docs/graph/README.md) and [utilities](ml/docs/utilities/metric.md) documentation.
These sections are explained in greater depth within the [FRESH](ml/docs/fresh.md), [cross validation](ml/docs/xval.md), [clustering](ml/docs/clustering/algos.md), [timeseries](ml/docs/timeseries/README.md), [optimization](ml/docs/optimize.md), [graph/pipeline](ml/docs/graph/README.md), [utilities](ml/docs/utilities/metric.md) and [registry](ml/docs/registry/README.md) documentation.


### nlp
Expand Down Expand Up @@ -171,3 +172,4 @@ The Machine Learning Toolkit is provided here under an Apache 2.0 license.
If you find issues with the interface or have feature requests, please [raise an issue](https://github.com/KxSystems/ml/issues).

To contribute to this project, please follow the [contributing guide](CONTRIBUTING.md).

1 change: 0 additions & 1 deletion automl/automl.q
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,3 @@ if[all`config`run in commandLineArguments;
testRun:`test in commandLineArguments;
runCommandLine[testRun];
exit 0]

5 changes: 5 additions & 0 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@

FROM registry.gitlab.com/kxdev/kxinsights/data-science/ml-tools/automl:embedpy-gcc-deb12

# Java and jq packages required for registry tests
RUN apt-get update && apt-get install -y openjdk-17-jdk && rm -rf /var/lib/apt/lists/*

ENV JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64/

COPY requirements_pinned.txt /opt/kx/automl/

USER kx
Expand Down
112 changes: 112 additions & 0 deletions ml/docs/registry/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# ML Registry

The ML Model Registry defines a centralised location for the storage of the following versioned entities:

1. Machine Learning Models
2. Model parameters
3. Performance metrics
4. Model configuration
5. Model monitoring information

The ML Registry is intended to allow models and all important metadata information associated with them to be stored locally.

In the context of an MLOps offering the model registry is a collaborative location allowing teams to work together on different stages of a machine learning workflow from model experimentation to publishing a model to production. It is designed to aid in this in the following ways:

1. Provide a solution with which users can store models generated in q/Python to a centralised location on-prem.
2. A common model retrieval API allowing models regardless of underlying requirements to be retrieved and used on kdb+ data.
3. The ability to store information related to model training/monitoring requirements, allowing sysadmins to control the promotion of models to production environments.
4. Enhanced team collaboration opportunities and management oversight by centralising team work to a common storage location.

## Contents

- [Quick start](#quick-start)
- [Documentation](#documentation)
- [Testing](#testing)
- [Status](#status)


## Quick start

Start by following the installation step found [here](../../../README.md) or alternatively start a q session using the code below from the `ml` folder

```
$ q init.q
q)
```

Generate a model registry in the current directory and display the contents

```
q).ml.registry.new.registry[::;::];
q)\ls
"CODEOWNERS"
"CONTRIBUTING.md"
"KX_ML_REGISTRY"
...
q)\ls KX_ML_REGISTRY
"modelStore"
"namedExperiments"
"unnamedExperiments"
```

Add an experiment folder to the registry

```
q).ml.registry.new.experiment[::;"test";::];
q)\ls KX_ML_REGISTRY/namedExperiments/
"test"
```

Add a basic q model associated with the experiment

```
q).ml.registry.set.model[::;{x+1};"mymodel";"q";enlist[`experimentName]!enlist "test"]
```

Check that the model has been added to the modelStore

```
q)modelStore
registrationTime experimentName modelName uniqueID ..
-----------------------------------------------------------------------------..
2021.08.02D10:27:04.863096000 "test" "mymodel" 66f12a71-175b-cd56-7d0..
```

Retrieve the model and model information based on the model name and version

```
q).ml.registry.get.model[::;::;"mymodel";1 0]
modelInfo| `major`description`experimentName`folderPath`registryPath`modelSto..
model | {x+1}
```

## Documentation

### Static Documentation

Further information on the breakdown of the API for interacting with the ML-Registry and extended examples can be found in [Registry API](api/setting.md) and [Registry Examples](examples/basic.md).

This provides users with:

1. A breakdown of the API for interacting with the ML-Registry
2. Examples of interacting with a registry

# Testing

Unit tests are provided for testing the operation of this interface both as a local service. In order to facilitate this users must have embedPy or pykx installed alongside the following additional Python requirements, it is also advisable to have the python requirements_pinned.txt installed before running the below.

```
$ pip install pyspark xgboost
```

The local tests are run using a bespoke q script. The local tests can be run standalone using the instructions outlined below.

## Local testing

The below tests are ran from the `ml` directory and test results will output to console

```bash
$ q ../test.q registry/tests/registry.t
```

This should present a summary of results of the unit tests.
Loading