Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ergonomic of the Pytorch dataset and Generate embeddings for oxford pet #157

Merged
merged 19 commits into from
Sep 13, 2022

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Sep 12, 2022

  • Improve ergnomic of Pytorch Dataset to load local directory
  • Auto tune learning rate and uses other hyperparameters from torchvision site.
  • Example code to generate embeddings.

@eddyxu eddyxu self-assigned this Sep 12, 2022
@eddyxu eddyxu changed the title Fixes on oxford pet dataset training Improve ergonomic of the Pytorch dataset. generate embeddings for oxford pet Sep 13, 2022
@changhiskhan changhiskhan marked this pull request as ready for review September 13, 2022 01:46
@@ -39,7 +40,11 @@ def read_file(uri) -> bytes:
return fs.open_input_file(key).read()


def download_uris(uris: Iterable[str], func=read_file) -> Iterable[bytes]:
def download_image(uri: str) -> Image:
return ImageUri(uri).to_embedded()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image.create(uri).to_embedded() should work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i will try it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in #157 , use downloaded binary for now.

return ImageUri(uri).to_embedded()


def download_uris(uris: Iterable[str], func=read_file) -> Iterable[Image]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking -- does pool.map return results in the same order as the input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's my understanding.


import io
import os
import time
from typing import Callable, Optional
from typing import Optional, Callable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be alphabetical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

import lance.pytorch.data
from PIL import Image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just always call this PILImage or use a qualified PIL.Image ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, lemme change it to PIL.Image

images.append(img)
labels.append(label)
return torch.stack(images), torch.tensor(labels)
NUM_CLASSES = 38
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something we can get from torchvision? Do we want to hard code this to check against the dataset? Or do we want to just compute it from the dataset dictionary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific number on the dataset. I feel that it is overkill to calculate it dynamically via dataset. We especially need to support different formats (i.e., it requires some effort to dynamically compute this number in the raw format.).

train_loader = torch.utils.data.DataLoader(
dataset, num_workers=num_workers, batch_size=None, collate_fn=collate_fn
)
elif data_format == "raw":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also need to compare against parquet or is this sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add support for parquet later. Maybe tfrecord as well.

Comment on lines +73 to +86
extractor = create_feature_extractor(model.backbone, {"avgpool": "features"})
extractor = extractor.to("cuda")
with torch.no_grad():
dfs = []
for batch, pk in train_loader:
batch = batch.to("cuda")
features = extractor(batch)["features"].squeeze()
df = pd.DataFrame(
{
"pk": pk,
"features": features.tolist(),
}
)
dfs.append(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does fast.ai / huggingface have any conveniences for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the rest is using pytorch lightning, using fast.ai / hf seems need to convert them to raw pytorch and make it adapt fastai/hf?

Prob we can use https://pytorch-lightning.readthedocs.io/en/stable/deploy/production_basic.html

It still needs to match pk tho.

I can make it to use pytorch lightning's predict if desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually, this need to wrap into a separate module, as it uses create_feature_extractor(model.backbone, {"avgpool": "features"}) feature extractor, while the original module should do basic predictions (i.e, just return detected class).

import lance
import lance.pytorch.data

NUM_CLASSES = 38
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

"""
Image transform for training.

Adding random argumentations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

augmentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def __init__(
self,
crop_size: float,
mean: tuple[float] = (0.485, 0.456, 0.406),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do these defaults come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These numbers are used across torchvision, and used in the python code referred above.

Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@eddyxu eddyxu changed the title Improve ergonomic of the Pytorch dataset. generate embeddings for oxford pet Improve ergonomic of the Pytorch dataset and Generate embeddings for oxford pet Sep 13, 2022
@eddyxu eddyxu merged commit 2aeade4 into main Sep 13, 2022
@eddyxu eddyxu deleted the lei/train_pet branch September 13, 2022 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants