Improve ergonomic of the Pytorch dataset and Generate embeddings for oxford pet #157

eddyxu · 2022-09-12T17:19:54Z

Improve ergnomic of Pytorch Dataset to load local directory
Auto tune learning rate and uses other hyperparameters from torchvision site.
Example code to generate embeddings.

changhiskhan · 2022-09-13T01:46:39Z

python/benchmarks/bench_utils.py

@@ -39,7 +40,11 @@ def read_file(uri) -> bytes:
    return fs.open_input_file(key).read()


-def download_uris(uris: Iterable[str], func=read_file) -> Iterable[bytes]:
+def download_image(uri: str) -> Image:
+    return ImageUri(uri).to_embedded()


Image.create(uri).to_embedded() should work

Ok, i will try it now.

As discussed in #157 , use downloaded binary for now.

changhiskhan · 2022-09-13T01:48:12Z

python/benchmarks/bench_utils.py

+    return ImageUri(uri).to_embedded()
+
+
+def download_uris(uris: Iterable[str], func=read_file) -> Iterable[Image]:


Just double checking -- does pool.map return results in the same order as the input?

Yes, that's my understanding.

changhiskhan · 2022-09-13T02:48:42Z

python/benchmarks/oxford_pet/common.py


 import io
 import os
 import time
-from typing import Callable, Optional
+from typing import Optional, Callable


shouldn't this be alphabetical?

changhiskhan · 2022-09-13T02:50:05Z

python/benchmarks/oxford_pet/common.py

 import lance.pytorch.data
+from PIL import Image


Should we just always call this PILImage or use a qualified PIL.Image ?

Ok, lemme change it to PIL.Image

changhiskhan · 2022-09-13T02:51:16Z

python/benchmarks/oxford_pet/common.py

-        images.append(img)
-        labels.append(label)
-    return torch.stack(images), torch.tensor(labels)
+NUM_CLASSES = 38


is this something we can get from torchvision? Do we want to hard code this to check against the dataset? Or do we want to just compute it from the dataset dictionary?

This is specific number on the dataset. I feel that it is overkill to calculate it dynamically via dataset. We especially need to support different formats (i.e., it requires some effort to dynamically compute this number in the raw format.).

changhiskhan · 2022-09-13T02:58:57Z

python/benchmarks/oxford_pet/embeddings.py

+        train_loader = torch.utils.data.DataLoader(
+            dataset, num_workers=num_workers, batch_size=None, collate_fn=collate_fn
+        )
+    elif data_format == "raw":


do we also need to compare against parquet or is this sufficient?

will add support for parquet later. Maybe tfrecord as well.

changhiskhan · 2022-09-13T03:00:02Z

python/benchmarks/oxford_pet/embeddings.py

+    extractor = create_feature_extractor(model.backbone, {"avgpool": "features"})
+    extractor = extractor.to("cuda")
+    with torch.no_grad():
+        dfs = []
+        for batch, pk in train_loader:
+            batch = batch.to("cuda")
+            features = extractor(batch)["features"].squeeze()
+            df = pd.DataFrame(
+                {
+                    "pk": pk,
+                    "features": features.tolist(),
+                }
+            )
+            dfs.append(df)


does fast.ai / huggingface have any conveniences for this?

Since the rest is using pytorch lightning, using fast.ai / hf seems need to convert them to raw pytorch and make it adapt fastai/hf?

Prob we can use https://pytorch-lightning.readthedocs.io/en/stable/deploy/production_basic.html

It still needs to match pk tho.

I can make it to use pytorch lightning's predict if desired.

Oh, actually, this need to wrap into a separate module, as it uses create_feature_extractor(model.backbone, {"avgpool": "features"}) feature extractor, while the original module should do basic predictions (i.e, just return detected class).

changhiskhan · 2022-09-13T03:00:19Z

python/benchmarks/oxford_pet/train.py

+import lance
+import lance.pytorch.data
+
+NUM_CLASSES = 38


changhiskhan · 2022-09-13T03:00:53Z

python/benchmarks/oxford_pet/train.py

+    """
+    Image transform for training.
+
+    Adding random argumentations.


augmentation

changhiskhan · 2022-09-13T03:01:12Z

python/benchmarks/oxford_pet/train.py

+    def __init__(
+        self,
+        crop_size: float,
+        mean: tuple[float] = (0.485, 0.456, 0.406),


where do these defaults come from?

These numbers are used across torchvision, and used in the python code referred above.

changhiskhan

Lgtm

eddyxu self-assigned this Sep 12, 2022

eddyxu force-pushed the lei/train_pet branch from 76f1362 to 5ce1701 Compare September 13, 2022 01:08

eddyxu changed the title ~~Fixes on oxford pet dataset training~~ Improve ergonomic of the Pytorch dataset. generate embeddings for oxford pet Sep 13, 2022

changhiskhan marked this pull request as ready for review September 13, 2022 01:46

changhiskhan reviewed Sep 13, 2022

View reviewed changes

eddyxu and others added 19 commits September 13, 2022 20:02

be able to handle local related directory

c03e4cb

add shuffle

a51c2bb

use image in datagen

3394a07

make shuffle an option

5bc00c2

auto tune learning rate

68a5bb6

add randomlization in training data prep

438bede

add comments

4fe49e0

load

9d951b6

generate embeddings

845a027

generate embeddings over all dataset

0b95646

generate embeddings

11fe8ea

reduce the diff to main

a4157ea

more cleanups

d8d442a

customize output directory

3013f8e

address some comments

a01d2e9

try to use image ext

70eadd0

create image column

b1b9a56

cleanup and train on Image type

9ac7419

handle local directory

1eef4e7

eddyxu force-pushed the lei/train_pet branch from f43f274 to 1eef4e7 Compare September 13, 2022 20:02

changhiskhan approved these changes Sep 13, 2022

View reviewed changes

eddyxu changed the title ~~Improve ergonomic of the Pytorch dataset. generate embeddings for oxford pet~~ Improve ergonomic of the Pytorch dataset and Generate embeddings for oxford pet Sep 13, 2022

eddyxu merged commit 2aeade4 into main Sep 13, 2022

eddyxu deleted the lei/train_pet branch September 13, 2022 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ergonomic of the Pytorch dataset and Generate embeddings for oxford pet #157

Improve ergonomic of the Pytorch dataset and Generate embeddings for oxford pet #157

eddyxu commented Sep 12, 2022 •

edited

Loading

changhiskhan Sep 13, 2022

eddyxu Sep 13, 2022

eddyxu Sep 13, 2022

changhiskhan Sep 13, 2022

eddyxu Sep 13, 2022

changhiskhan Sep 13, 2022

eddyxu Sep 13, 2022

changhiskhan Sep 13, 2022

eddyxu Sep 13, 2022

changhiskhan Sep 13, 2022

eddyxu Sep 13, 2022

changhiskhan Sep 13, 2022

eddyxu Sep 13, 2022

changhiskhan Sep 13, 2022

eddyxu Sep 13, 2022

eddyxu Sep 13, 2022

changhiskhan Sep 13, 2022

eddyxu Sep 13, 2022

changhiskhan Sep 13, 2022

eddyxu Sep 13, 2022

changhiskhan Sep 13, 2022

eddyxu Sep 13, 2022

changhiskhan left a comment

		return ImageUri(uri).to_embedded()


		def download_uris(uris: Iterable[str], func=read_file) -> Iterable[Image]:

Improve ergonomic of the Pytorch dataset and Generate embeddings for oxford pet #157

Improve ergonomic of the Pytorch dataset and Generate embeddings for oxford pet #157

Conversation

eddyxu commented Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

changhiskhan left a comment

Choose a reason for hiding this comment

eddyxu commented Sep 12, 2022 •

edited

Loading