Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent Dataset Handling #28

Open
AmitMY opened this issue Feb 19, 2022 · 5 comments
Open

Consistent Dataset Handling #28

AmitMY opened this issue Feb 19, 2022 · 5 comments

Comments

@AmitMY
Copy link

AmitMY commented Feb 19, 2022

Very nice repo and documentation!

I think this repository can benefit from using https://github.com/sign-language-processing/datasets as data loaders.

It is fast, consistent across datasets, and allows loading videos / poses from multiple datasets.
If a dataset you are using is not there, you can ask for it or add it yourself, it is a breeze.

The repo supports many datasets, multiple pose estimation formats, binary pose files, fps and resolution manipulations, and dataset disk mapping.

Finally, this would make this repo less complex. This repo does pre-training and fine-tuning, the other repo does datasets, and they could be used together.

Please consider :)

@GokulNC
Copy link
Member

GokulNC commented Feb 21, 2022

Thanks for this suggestion @AmitMY . Interesting! We will check it out in detail and get back to you here.
We are not familiar with using tfds, so we'll have to see if there are any setbacks in using it for our case.

Also, it would be great if you can share with us some resources/pointers on how to get started with creating this custom tfds dataset using .pose files in a way that is expected by your datasets library. (probably as an .md file in your repo itself)

One challenge in our case is that we use PyTorch Lightning in this repo. So we're not sure how those dataloader flows could be used with tfds.

@AmitMY
Copy link
Author

AmitMY commented Feb 21, 2022

Thanks for being open to this.

Tensorflow has many tutorials about adding datasets, including - https://www.tensorflow.org/datasets/add_dataset

But perhaps also just looking at the code of one dataset might be useful.

Regarding PyTorch Lightning - that is no problem. I have consistently worked with tfds for pytorch without any issues.

The simplest way would be just make it all numpy - https://www.tensorflow.org/datasets/api_docs/python/tfds/as_numpy

But you can also perform whatever operations you want on tfds (batching, mapping, prefetching, shuffling, etc) and then for each batch do as_numpy in order to be memory efficient.

Please let me know if there's anything concrete that you are not sure about, and I'll see if I can make an example.

@Prem-kumar27
Copy link
Contributor

Prem-kumar27 commented Feb 21, 2022

Thanks @AmitMY

Currently our data pipeline is as like we lazy load the pose data for only the batch of videos and then do augmentations for them and then the data is used by the model. We also use Pytorch-Lightning's LightningDataModule for this. This can be found here.

We are not sure of how to use TFDS's dataset module here. One way would be convert the whole TFDS dataset as a torch Tensor and wrap it with the torch Dataset class. But this would require the whole dataset to be in memory. Is there any other way to do this?

Basically, iterating over batches is being handled by Pytorch-lightning in our case. So, we are not sure of how to make use of TFDS here.

@AmitMY
Copy link
Author

AmitMY commented Feb 21, 2022

How about wrapping the tfds with a generic wrapper that makes the data in torch?

from sign_language_datasets.utils.torch_dataset import TFDSTorchDataset

# Fast download and load dataset using TFDS
config = SignDatasetConfig(name="holistic-poses", version="1.0.0", include_video=False, include_pose="holistic")
dicta_sign = tfds.load(name='dicta_sign', builder_kwargs={"config": config})

# Convert to torch dataset
train_dataset = TFDSTorchDataset(dicta_sign["train"])

for datum in itertools.islice(train_dataset, 0, 10):
    print(datum)

Which in this case for example, returns the following dictionary:

{
    "gloss": "ERLAUBNIS2", 
    "hamnosys": "\xee\x83\xa9\xee\x80\x85\xee\x80\x8c\xee\x81\xb2\xee\x80\x90\xee\x80\xa0\xee\x80\xbf\xee\x83\xa2\xee\x81\x82\xee\x81\x99\xee\x83\x91\xee\x83\xa7\xee\x81\x92\xee\x83\xa3\xee\x83\xa2\xee\x82\x90\xee\x82\xaa\xee\x80\xb1\xee\x80\xbc\xee\x83\xa3", 
    "id": "54_DGS", 
    "pose": {
      "data": tensor([[[[ 9.4747e+01,  8.0048e+01, -1.2109e-04],
              [ 9.8266e+01,  7.4415e+01,  2.2603e-03],
              [ 1.0062e+02,  7.4430e+01, -3.8285e-03],
              ...,
              [ 6.1661e+01,  1.7587e+02, -3.8705e-02]]]]), 
      "conf": tensor([[[1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000]],
            ...,
            [[1.0000, 1.0000, 1.0000,  ..., 1.0000, 1.0000, 1.0000]]]), 
      "fps": 25
    }, 
    "signed_language": "DGS", 
    "spoken_language": "de", 
    "text": "Erlaubnis", 
    "video": "https://www.sign-lang.uni-hamburg.de/dicta-sign/portal/concepts/dgs/54.webm"
}

@Prem-kumar27
Copy link
Contributor

Thanks. I think this could work.
We will try this and get back to you if we had any questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants