From 3edc532d1389b53c6c973e19276cc5b10b28fb5a Mon Sep 17 00:00:00 2001
From: Chang She <759245+changhiskhan@users.noreply.github.com>
Date: Tue, 18 Oct 2022 14:39:23 -0700
Subject: [PATCH] Add model inference notebook (#244)
---
.gitignore | 4 +
python/notebooks/peeking_duck.ipynb | 619 ++++++++++++++++++++++++++++
2 files changed, 623 insertions(+)
create mode 100644 python/notebooks/peeking_duck.ipynb
diff --git a/.gitignore b/.gitignore
index 8ac5003e1a..8fd8d0c844 100644
--- a/.gitignore
+++ b/.gitignore
@@ -62,3 +62,7 @@ docs/api/python
**/.ipynb_checkpoints/
docs/notebooks
+
+
+integration/duckdb/manylinux-build
+python/notebooks/lance.duckdb_extension
\ No newline at end of file
diff --git a/python/notebooks/peeking_duck.ipynb b/python/notebooks/peeking_duck.ipynb
new file mode 100644
index 0000000000..31e598cd55
--- /dev/null
+++ b/python/notebooks/peeking_duck.ipynb
@@ -0,0 +1,619 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "1713d835",
+ "metadata": {},
+ "source": [
+ "# Peeking Duck: duckdb + lance for computer vision\n",
+ "`SELECT predict('resnet', image) FROM tbl`"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c17856d3",
+ "metadata": {},
+ "source": [
+ "Duckdb gives us the opportunity to simplify a huge part\n",
+ "of the ML workflow for computer vision. This notebook shows\n",
+ "how to use the Lance duckdb extension to run a pytorch\n",
+ "model on images in SQL:\n",
+ "\n",
+ "```sql\n",
+ "SELECT filename, class, predict('resnet', image) as pred\n",
+ "FROM oxford_pet\n",
+ "WHERE split='train' AND class='samoyed'\n",
+ "USING SAMPLE 1000;\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7e0088ac",
+ "metadata": {},
+ "source": [
+ "This is made possible using DuckDB in conjunction with Lance,
\n",
+ "a new columnar data format for computer vision (CV).
\n",
+ "Lance is like Parquet but built with CV in mind,
\n",
+ "with fast point-access, partial reads and optimisations for nested annotation columns. \n",
+ "\n",
+ "tl;dr - Lance is a more performant and CV-specific parquet.\n",
+ "\n",
+ "Lance can be accessed by tools like Pandas and DuckDB via Apache Arrow. Let's see it in action."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6adde508",
+ "metadata": {},
+ "source": [
+ "For reference:\n",
+ "1. The Lance file format lives [here](https://github.com/eto-ai/lance)\n",
+ "2. The Lance duckdb extension is under the [/integrations/duckdb](https://github.com/eto-ai/lance/tree/main/integration/duckdb) subdirectory in Lance"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b5315186",
+ "metadata": {},
+ "source": [
+ "## Model setup\n",
+ "It only takes a few lines of code to prepare a model for inference"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "750d8b66",
+ "metadata": {},
+ "source": [
+ "### Creating the model\n",
+ "\n",
+ "Convert a pre-trained resnet to torchscript and save it"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "065e6cd3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import torch\n",
+ "from torchvision.models import resnet50, ResNet50_Weights\n",
+ "\n",
+ "resnet = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)\n",
+ "m = torch.jit.script(resnet)\n",
+ "torch.jit.save(m, '/tmp/model.pth')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bf16a2c4",
+ "metadata": {},
+ "source": [
+ "### Setup the duckdb extension\n",
+ "\n",
+ "For now, the extension should be [built from source](https://github.com/eto-ai/lance/tree/main/integration/duckdb#development).\n",
+ "\n",
+ "Once that's done, we can install and load the extensio. In this example we copied the artifact `lance.duckdb_extension` into the same directory as this notebook. Instead, you can also just supply it with a relative path in the `install_extension` call below."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "f4a554ea",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import torch\n",
+ "import duckdb\n",
+ "con = duckdb.connect(config={\"allow_unsigned_extensions\": True})\n",
+ "con.install_extension(\"lance.duckdb_extension\", force_install=True)\n",
+ "con.load_extension(\"lance\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6aa616cc",
+ "metadata": {},
+ "source": [
+ "### Load the torchscript model\n",
+ "For each model that we saved as torchscript,
\n",
+ "use the `create_pytorch_model` function to register the model
\n",
+ "from where we saved it. We can list all registered functions
\n",
+ "using the `ml_models` table function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "6a293bf7",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " name | \n",
+ " uri | \n",
+ " type | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " resnet | \n",
+ " /tmp/model.pth | \n",
+ " torchscript | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " name uri type\n",
+ "0 resnet /tmp/model.pth torchscript"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "con.query(\"CALL create_pytorch_model('resnet', '/tmp/model.pth');\")\n",
+ "con.query(\"SELECT * FROM ml_models();\").to_df()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "055dbcc0",
+ "metadata": {},
+ "source": [
+ "## Load the dataset\n",
+ "\n",
+ "Resnet was trained on ImageNet, but it'd be fun to run
\n",
+ "it on a different dataset. Here we use the [Oxford Pet\n",
+ "dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/).
\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "645b8d32",
+ "metadata": {},
+ "source": [
+ "The raw dataset is organized into\n",
+ "```\n",
+ "/images\n",
+ "/annotations\n",
+ "```\n",
+ "and /annotations has\n",
+ "1. data indices: list.txt, test.txt, trainval.txt\n",
+ "2. /xmls annotations in pascal voc format\n",
+ "3. /trimap of trimap (png's)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ab391f06",
+ "metadata": {},
+ "source": [
+ "To make it queryable, we've converted it into a Lance dataset on a public s3 bucket"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "a19d5e1a",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "['_pk', 'filename', 'class', 'species', 'breed', 'split', 'folder', 'source', 'size', 'segmented', 'object', 'external_image', 'image']\n"
+ ]
+ }
+ ],
+ "source": [
+ "from lance import LanceFileFormat\n",
+ "import pyarrow.dataset as ds\n",
+ "uri =\"s3://eto-public/datasets/oxford_pet/oxford_pet.lance\"\n",
+ "oxford_pet = ds.dataset(uri, format=LanceFileFormat())\n",
+ "print(oxford_pet.schema.names)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f12a3249",
+ "metadata": {},
+ "source": [
+ "If you're interested in more details on `LanceFileFormat`,
\n",
+ "head over to [Lance github page](https://github.com/eto-ai/lance) or stay tuned for another post!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "44216237",
+ "metadata": {},
+ "source": [
+ "## Query the data\n",
+ "You can use duckdb to query Lance data via Apache Arrow"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "06cc3e6c",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ ""
+ ],
+ "text/plain": [
+ "Image(s3://eto-public/datasets/oxford_pet/images/samoyed_100.jpg)"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from lance.types import Image\n",
+ "\n",
+ "df = con.query(\"\"\"\n",
+ "SELECT external_image as image_uri\n",
+ "FROM oxford_pet\n",
+ "WHERE class='samoyed'\n",
+ "LIMIT 10;\n",
+ "\"\"\").to_df()\n",
+ "\n",
+ "Image.create(df.image_uri[0])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fc0ff204",
+ "metadata": {},
+ "source": [
+ "## Let's make a prediction\n",
+ "Now it's time to use the registered resnet model to do inference"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "367856f7",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " class | \n",
+ " pred | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " samoyed | \n",
+ " 258 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " samoyed | \n",
+ " 258 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " samoyed | \n",
+ " 258 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " samoyed | \n",
+ " 258 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " samoyed | \n",
+ " 258 | \n",
+ "
\n",
+ " \n",
+ " 5 | \n",
+ " samoyed | \n",
+ " 258 | \n",
+ "
\n",
+ " \n",
+ " 6 | \n",
+ " samoyed | \n",
+ " 258 | \n",
+ "
\n",
+ " \n",
+ " 7 | \n",
+ " samoyed | \n",
+ " 258 | \n",
+ "
\n",
+ " \n",
+ " 8 | \n",
+ " samoyed | \n",
+ " 258 | \n",
+ "
\n",
+ " \n",
+ " 9 | \n",
+ " samoyed | \n",
+ " 258 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " class pred\n",
+ "0 samoyed 258\n",
+ "1 samoyed 258\n",
+ "2 samoyed 258\n",
+ "3 samoyed 258\n",
+ "4 samoyed 258\n",
+ "5 samoyed 258\n",
+ "6 samoyed 258\n",
+ "7 samoyed 258\n",
+ "8 samoyed 258\n",
+ "9 samoyed 258"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "predictions = con.query(\"\"\"\n",
+ "SELECT class, list_argmax(predict('resnet', image)) as pred\n",
+ "FROM oxford_pet\n",
+ "WHERE split='train' AND class='samoyed'\n",
+ "LIMIT 10;\n",
+ "\"\"\").to_df()\n",
+ "predictions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e3e2416c",
+ "metadata": {},
+ "source": [
+ "`predict` allows you to invoke a registered model by name
\n",
+ "`image` is a binary column of the image bytes
\n",
+ "`list_argmax` finds the position with the highest output probability"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "416303b6",
+ "metadata": {},
+ "source": [
+ "### What does 258 mean?\n",
+ "How do we know if the predictions are reasonable?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "f4bcb24d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "labels_uri = (\"https://raw.githubusercontent.com/anishathalye/imagenet-simple-labels\"\n",
+ " \"/master/imagenet-simple-labels.json\")\n",
+ "\n",
+ "labels = (pd.read_json(labels_uri)\n",
+ " .reset_index()\n",
+ " .rename(columns={0: \"label\", \"index\": \"label_id\"}))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "2f150c09",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " gt_label | \n",
+ " pred_label | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " samoyed | \n",
+ " Samoyed | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " samoyed | \n",
+ " Samoyed | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " samoyed | \n",
+ " Samoyed | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " samoyed | \n",
+ " Samoyed | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " samoyed | \n",
+ " Samoyed | \n",
+ "
\n",
+ " \n",
+ " 5 | \n",
+ " samoyed | \n",
+ " Samoyed | \n",
+ "
\n",
+ " \n",
+ " 6 | \n",
+ " samoyed | \n",
+ " Samoyed | \n",
+ "
\n",
+ " \n",
+ " 7 | \n",
+ " samoyed | \n",
+ " Samoyed | \n",
+ "
\n",
+ " \n",
+ " 8 | \n",
+ " samoyed | \n",
+ " Samoyed | \n",
+ "
\n",
+ " \n",
+ " 9 | \n",
+ " samoyed | \n",
+ " Samoyed | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " gt_label pred_label\n",
+ "0 samoyed Samoyed\n",
+ "1 samoyed Samoyed\n",
+ "2 samoyed Samoyed\n",
+ "3 samoyed Samoyed\n",
+ "4 samoyed Samoyed\n",
+ "5 samoyed Samoyed\n",
+ "6 samoyed Samoyed\n",
+ "7 samoyed Samoyed\n",
+ "8 samoyed Samoyed\n",
+ "9 samoyed Samoyed"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "con.query(\n",
+ " \"\"\"\n",
+ " SELECT \n",
+ " predictions.class as gt_label,\n",
+ " labels.label as pred_label\n",
+ " FROM predictions \n",
+ " INNER JOIN labels on predictions.pred=labels.label_id;\n",
+ " \"\"\"\n",
+ ").to_df()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bc94a61c",
+ "metadata": {},
+ "source": [
+ "## Conclusion"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f3841ad7",
+ "metadata": {},
+ "source": [
+ "As you can see from this notebook, Lance makes it easy to do analytics and model inference in SQL by using the Lance extension for duckdb. With Lance (via Arrow), we can manage images, metadata, and annotations all in one place, and we can query it efficiently even when the data lives in cheap remote storage.\n",
+ "\n",
+ "You can find Lance here: https://github.com/eto-ai/lance If you like us, we'd love a star on our project, and we'd appreciate your feedback even more!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "55ff5597",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.4"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}