diff --git a/doc/source/examples/notebooks.rst b/doc/source/examples/notebooks.rst index 2f4973c648..20117d9ccc 100644 --- a/doc/source/examples/notebooks.rst +++ b/doc/source/examples/notebooks.rst @@ -151,6 +151,7 @@ Complex Graph Examples :titlesonly: Chainer MNIST + Custom pre-processors with the V2 Protocol Ingress ------- diff --git a/doc/source/examples/transformers-v2-protocol.nblink b/doc/source/examples/transformers-v2-protocol.nblink new file mode 100644 index 0000000000..bc687ca63e --- /dev/null +++ b/doc/source/examples/transformers-v2-protocol.nblink @@ -0,0 +1,4 @@ +{ + "path": "../../../examples/transformers/v2-protocol/README.ipynb", + "extra-media": ["../../../examples/transformers/v2-protocol"] +} diff --git a/examples/transformers/v2-protocol/README.ipynb b/examples/transformers/v2-protocol/README.ipynb new file mode 100644 index 0000000000..a213f25193 --- /dev/null +++ b/examples/transformers/v2-protocol/README.ipynb @@ -0,0 +1,343 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cdb42efd-86dd-4d4e-904f-5956aa4727ad", + "metadata": {}, + "source": [ + "# Custom pre-processors with the V2 protocol\n", + "\n", + "Most of the time, the requests that we send to our model need some kind of processing.\n", + "For example, extra information may need to be fetched (e.g. from a feature store), or processed, in order to obtain the actual tensors required by the model. One example for this use case are NLP models, where natural language needs first to be tokenised according to a vocabulary, or embedded by a 2nd model. \n", + "\n", + "In this tutorial, we will focus on this latter scenario.\n", + "In particular, we will explore how to deploy a _tokeniser_ pre-transformer that converts our natural language text to tokens. \n", + "This tokeniser will then be part of an inference graph, so that its output gets routed to a [GPT-2 model deployed using Triton](https://docs.seldon.io/projects/seldon-core/en/latest/examples/triton_gpt2_example.html).\n", + "\n", + "> **NOTE**: The tokeniser logic and the Triton artifacts are taken from the [GPT-2 Model example](https://docs.seldon.io/projects/seldon-core/en/latest/examples/triton_gpt2_example.html). To learn more about these, feel free to check that tutorial.\n", + "\n", + "![Inference graph with tokeniser and GPT-2 model](./gpt2-graph.svg)" + ] + }, + { + "cell_type": "markdown", + "id": "cd60a985-54e3-4973-a640-40ed52d30203", + "metadata": {}, + "source": [ + "## Creating a Tokeniser\n", + "\n", + "In order to create a custom pre-processing step, the first step will be to [write a **custom runtime**](https://mlserver.readthedocs.io/en/latest/runtimes/custom.html) using [MLServer](https://mlserver.readthedocs.io/en/latest/).\n", + "MLServer is a production-grade inference server, whose main goal is to ease up the serving of models through a REST and gRPC interface compatible with the [V2 Inference Protocol](https://kserve.github.io/website/modelserving/inference_api/).\n", + "\n", + "As well as an inference server, MLServer also exposes a *framework* which can be leveraged to easily **write your custom inference runtimes**.\n", + "These custom runtimes can be used to write any custom logic, including (you guessed it!) our tokeniser pre-processor.\n", + "Therefore, we will start by extending the base `mlserver.MLModel` class, adding our custom logic.\n", + "Note that this logic is taken (almost) verbatim from the [GPT-2 Model example](https://docs.seldon.io/projects/seldon-core/en/latest/examples/triton_gpt2_example.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d0c2ffb-a268-4ad7-af43-a94edfc5c2d6", + "metadata": {}, + "outputs": [], + "source": [ + "# %load tokeniser/runtime.py\n", + "from mlserver import MLModel\n", + "from mlserver.types import InferenceRequest, InferenceResponse\n", + "from mlserver.codecs import NumpyCodec\n", + "from mlserver.codecs.string import StringRequestCodec\n", + "from transformers import GPT2Tokenizer\n", + "\n", + "\n", + "class Tokeniser(MLModel):\n", + " async def load(self) -> bool:\n", + " self._tokeniser = GPT2Tokenizer.from_pretrained(\"gpt2\")\n", + "\n", + " self.ready = True\n", + " return self.ready\n", + "\n", + " async def predict(self, inference_request: InferenceRequest) -> InferenceResponse:\n", + " sentences = StringRequestCodec.decode(inference_request)\n", + " tokenised = self._tokeniser(sentences, return_tensors=\"np\")\n", + "\n", + " outputs = []\n", + " for name, payload in tokenised.items():\n", + " inference_output = NumpyCodec.encode(name=name, payload=payload)\n", + " # Transformer's TF GPT2 model expects `INT32` inputs by default, so\n", + " # let's enforce them\n", + " inference_output.datatype = \"INT32\"\n", + " outputs.append(inference_output)\n", + "\n", + " return InferenceResponse(\n", + " model_name=self.name, model_version=self.version, outputs=outputs\n", + " )\n" + ] + }, + { + "cell_type": "markdown", + "id": "f8b4ca8d-5948-4ad1-a401-15457c29a5e8", + "metadata": {}, + "source": [ + "Note that the pre-processing logic is implemented in the `predict()` method.\n", + "At the moment, the MLServer framework doesn't expose the concept of pre- and post-processing.\n", + "However, it's possible to implement this is a _\"pseudo-model\"_, thus relying on the service orchestrator of Seldon Core, who will be responsible of chaining the output of our tokeniser to the next model.\n" + ] + }, + { + "cell_type": "markdown", + "id": "6d46ea7b-455b-4df0-aebb-9ef5296cd2b9", + "metadata": {}, + "source": [ + "### Requirements and default model settings\n", + "\n", + "Besides writing the logic of our custom runtime, we will also need to provide the extra requirements that will be used by our environment.\n", + "This can be done through a plain `requirements.txt` file.\n", + "Alternatively, for a finer control, it'd also be possible to leverage [Conda's environment files](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually) to specify our environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fec97bba-08a4-444f-a2b5-f6b6af461567", + "metadata": {}, + "outputs": [], + "source": [ + "# %load tokeniser/requirements.txt\n", + "mlserver==0.6.0.dev3\n", + "transformers==4.12.3\n" + ] + }, + { + "cell_type": "markdown", + "id": "8422729e-1b9c-4e94-8ce0-12466f467003", + "metadata": {}, + "source": [ + "On top of this, we will also add a `model-settings.json` file with the default settings for our model.\n", + "MLServer uses these files to provide extra configuration (e.g. number of parallel workers, adaptive batching configuration, etc.) for each model.\n", + "In our case, we will use this file to tell MLServer that it should always use our custom runtime by default and name our models as `tokeniser` (unless other name is specified)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d919825e-bcbf-4d07-b03b-09d07a516663", + "metadata": {}, + "outputs": [], + "source": [ + "# %load tokeniser/model-settings.json\n", + "{\n", + " \"name\": \"tokeniser\",\n", + " \"implementation\": \"runtime.Tokeniser\"\n", + "}\n" + ] + }, + { + "cell_type": "markdown", + "id": "4ffdfa09-61b8-46d6-82d1-097e938f3bea", + "metadata": {}, + "source": [ + "### Testing our tokeniser\n", + "\n", + "> **NOTE**: To test our custom runtime locally, we will need to install the same set of dependencies that will be bundled and deployed remotely.\n", + " To achieve this, we can re-use the environment that was described on the previous section:\n", + " \n", + " ```bash\n", + " pip install -r ./tokeniser/requirements.txt\n", + " ```\n", + "\n", + "\n", + "Since we're leveraging MLServer to write our custom pre-processor, it should be **easy to test it locally**.\n", + "For this, we will start MLServer using the [`mlserver start` subcommand](https://mlserver.readthedocs.io/en/latest/reference/cli.html#mlserver-start).\n", + "Note that this command has to be carried out on a separate terminal:\n", + "\n", + "```bash\n", + "mlserver start ./tokeniser\n", + "```\n", + "\n", + "We can then send a test request using `curl` as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7a0ee123-07e3-45e1-ae3c-c80a96ffed58", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "curl localhost:8080/v2/models/tokeniser/infer \\\n", + " -H 'Content-Type: application/json' \\\n", + " -d '{\"inputs\": [{\"name\": \"sentences\", \"datatype\": \"BYTES\", \"shape\": [1, 11], \"data\": \"hello world\"}]}' \\\n", + " | python -m json.tool " + ] + }, + { + "cell_type": "markdown", + "id": "846d4500-00eb-46eb-a800-6b71486e1457", + "metadata": {}, + "source": [ + "As we can see above, the input `hello world` gets tokenised into `[31373, 995]`, thus confirming that our custom runtime is working as expected locally." + ] + }, + { + "cell_type": "markdown", + "id": "3953dfee-7c21-4586-a1b5-a3e33a2e9231", + "metadata": {}, + "source": [ + "### Building the image\n", + "\n", + "Once we have our custom code tested and ready, we should be able to build our custom image by using the [`mlserver build` subcommand](https://mlserver.readthedocs.io/en/latest/reference/cli.html#mlserver-build).\n", + "This image will be created under the `gpt2-tokeniser:0.1.0` tag." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d2a8821-b584-4cfe-9f1d-a26d4f92b06b", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%bash\n", + "mlserver build ./tokeniser --tag gpt2-tokeniser:0.1.0" + ] + }, + { + "cell_type": "markdown", + "id": "308134b7-a0fb-4714-bc52-0942e9f9686d", + "metadata": {}, + "source": [ + "## Deploying our inference graph\n", + "\n", + "Now that we have our custom tokeniser built and ready, we are able to deploy it alongside our GPT-2 model.\n", + "This can be achieved through a `SeldonDeployment` manifest which **links both models**.\n", + "That is, our tokeniser, plus the actual GPT-2 model.\n", + "\n", + "As outlined above, this manifest will re-use the image and resources built in the [GPT-2 Model example](https://docs.seldon.io/projects/seldon-core/en/latest/examples/triton_gpt2_example.html), which is accessible from GCS.\n", + "\n", + "> **NOTE:** This manifest expects that the `gpt2-tokeniser:0.1.0` image built in the previous section **is accessible** from within the cluster where Seldon Core has been installed. If you are [using `kind`](https://docs.seldon.io/projects/seldon-core/en/latest/install/kind.html), you should be able to load the image into your local cluster with the following command:\n", + "```bash\n", + "kind load docker-image gpt2-tokeniser:0.1.0\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bcb676bc-21ef-438c-b398-d18e51ad6188", + "metadata": {}, + "outputs": [], + "source": [ + "# %load seldondeployment.yaml\n", + "apiVersion: machinelearning.seldon.io/v1\n", + "kind: SeldonDeployment\n", + "metadata:\n", + " name: gpt2\n", + "spec:\n", + " protocol: kfserving\n", + " predictors:\n", + " - name: default\n", + " graph:\n", + " name: tokeniser\n", + " children:\n", + " - name: gpt2\n", + " implementation: TRITON_SERVER\n", + " modelUri: gs://seldon-models/triton/onnx_gpt2\n", + " componentSpecs:\n", + " - spec:\n", + " containers:\n", + " - name: tokeniser\n", + " image: gpt2-tokeniser:0.1.0\n", + " securityContext:\n", + " # Workaround for multi-user bug in custom images in MLServer:\n", + " # https://github.com/SeldonIO/MLServer/issues/388\n", + " runAsUser: 1000\n", + " ports:\n", + " - containerPort: 8080\n", + " name: http\n", + " protocol: TCP\n", + " - containerPort: 8081\n", + " name: grpc\n", + " protocol: TCP\n" + ] + }, + { + "cell_type": "markdown", + "id": "a103cd18-6fb0-4db5-aaa0-c887b860f613", + "metadata": {}, + "source": [ + "The final step will be to apply this manifest into the cluster, where Seldon Core is running.\n", + "For example, to deploy the manifest into the `models` namespace, we could run the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7fabbd5f-8729-46ce-9eae-588114ab279c", + "metadata": {}, + "outputs": [], + "source": [ + "!kubectl create namespace models --dry-run=client -o yaml | kubectl apply -f - \n", + "!kubectl apply -f seldondeployment.yaml -n models" + ] + }, + { + "cell_type": "markdown", + "id": "5ca23aec-f3e3-4174-8ce6-2df1d7d6cea7", + "metadata": {}, + "source": [ + "### Testing our deployed inference graph\n", + "\n", + "Finally, we can test that our deployed inference graph is working as expected by sending a request.\n", + "If we assume that our cluster can be reached in `localhost:8003`, we can send a request using `cURL` as:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "faff9193-b7c8-4989-95f3-817ebc276a1c", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%bash\n", + "curl localhost:8003/seldon/models/gpt2/v2/models/infer \\\n", + " -H 'Content-Type: application/json' \\\n", + " -d '{\"inputs\": [{\"name\": \"sentences\", \"datatype\": \"BYTES\", \"shape\": [1, 11], \"data\": \"hello world\"}]}' \\\n", + " | python -m json.tool " + ] + }, + { + "cell_type": "markdown", + "id": "7322029d-70f0-4c20-9f81-62c84b2b846a", + "metadata": {}, + "source": [ + "As we can see above, our plain-text request is now going successfully through the `tokeniser`, acting as a pre-processor, whose output then gets routed to the actual GPT-2 model. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.8" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/transformers/v2-protocol/gpt2-graph.svg b/examples/transformers/v2-protocol/gpt2-graph.svg new file mode 100644 index 0000000000..c22f2215d7 --- /dev/null +++ b/examples/transformers/v2-protocol/gpt2-graph.svg @@ -0,0 +1,3 @@ + + +
SeldonDeployment
SeldonDeployment
tokeniser
tokeniser
GPT-2
GPT-2
"hello world"
"hello world"
logits + prediction
logits + predic...
Viewer does not support full SVG 1.1
\ No newline at end of file diff --git a/examples/transformers/v2-protocol/seldondeployment.yaml b/examples/transformers/v2-protocol/seldondeployment.yaml new file mode 100644 index 0000000000..6aa792caf1 --- /dev/null +++ b/examples/transformers/v2-protocol/seldondeployment.yaml @@ -0,0 +1,30 @@ +apiVersion: machinelearning.seldon.io/v1 +kind: SeldonDeployment +metadata: + name: gpt2 +spec: + protocol: kfserving + predictors: + - name: default + graph: + name: tokeniser + children: + - name: gpt2 + implementation: TRITON_SERVER + modelUri: gs://seldon-models/triton/onnx_gpt2 + componentSpecs: + - spec: + containers: + - name: tokeniser + image: gpt2-tokeniser:0.1.0 + securityContext: + # Workaround for multi-user bug in custom images in MLServer: + # https://github.com/SeldonIO/MLServer/issues/388 + runAsUser: 1000 + ports: + - containerPort: 8080 + name: http + protocol: TCP + - containerPort: 8081 + name: grpc + protocol: TCP diff --git a/examples/transformers/v2-protocol/tokeniser/model-settings.json b/examples/transformers/v2-protocol/tokeniser/model-settings.json new file mode 100644 index 0000000000..2dbcbd2b80 --- /dev/null +++ b/examples/transformers/v2-protocol/tokeniser/model-settings.json @@ -0,0 +1,4 @@ +{ + "name": "tokeniser", + "implementation": "runtime.Tokeniser" +} diff --git a/examples/transformers/v2-protocol/tokeniser/requirements.txt b/examples/transformers/v2-protocol/tokeniser/requirements.txt new file mode 100644 index 0000000000..702b4ed31e --- /dev/null +++ b/examples/transformers/v2-protocol/tokeniser/requirements.txt @@ -0,0 +1,2 @@ +mlserver==0.6.0.dev3 +transformers==4.12.3 diff --git a/examples/transformers/v2-protocol/tokeniser/runtime.py b/examples/transformers/v2-protocol/tokeniser/runtime.py new file mode 100644 index 0000000000..3ca6078675 --- /dev/null +++ b/examples/transformers/v2-protocol/tokeniser/runtime.py @@ -0,0 +1,29 @@ +from mlserver import MLModel +from mlserver.types import InferenceRequest, InferenceResponse +from mlserver.codecs import NumpyCodec +from mlserver.codecs.string import StringRequestCodec +from transformers import GPT2Tokenizer + + +class Tokeniser(MLModel): + async def load(self) -> bool: + self._tokeniser = GPT2Tokenizer.from_pretrained("gpt2") + + self.ready = True + return self.ready + + async def predict(self, inference_request: InferenceRequest) -> InferenceResponse: + sentences = StringRequestCodec.decode(inference_request) + tokenised = self._tokeniser(sentences, return_tensors="np") + + outputs = [] + for name, payload in tokenised.items(): + inference_output = NumpyCodec.encode(name=name, payload=payload) + # Transformer's TF GPT2 model expects `INT32` inputs by default, so + # let's enforce them + inference_output.datatype = "INT32" + outputs.append(inference_output) + + return InferenceResponse( + model_name=self.name, model_version=self.version, outputs=outputs + )