Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten dataset #1610

Closed
CourchesneA opened this issue Sep 18, 2024 · 2 comments
Closed

Flatten dataset #1610

CourchesneA opened this issue Sep 18, 2024 · 2 comments
Assignees

Comments

@CourchesneA
Copy link

I have a datumaro dataset that has nested items, files often have paths such as mydir/file1.jpg. Is there a way to flatten it using datumaro ? I would like to iterate over each item, move the item.media.path to the root (possibly check if there's already a file of that name), update the item.id then re-export. Is there a way to do this ?

@jihyeonyi
Copy link

Hi @CourchesneA, thank you for your continued interest.

Currently, there is no flatten feature in Datumaro, but there are tricky ways to achieve flattening.
When exporting a dataset in Datumaro format, the path of the image is determined by the id and the subset in the DatasetItem. For instance, if the id is "mydir/img1" and the subset is "mysubset", then it would be set as "images/mysubset/mydir/img1.jpg".
Therefore, if all subsets are the same, you can save all images in one folder (e.g., images/mysubset) by changing the id accordingly.
To achieve this, you can use the reindex transform (link).

dataset.transform("reindex", start=0)
dataset.export("flattened", "datumaro", save_media=True)

If the dataset contains multiple subsets, you should use the map_subsets transform to merge the subsets into one, then perform the reindex transform to prevent duplicate ids before exporting.

mapping = {subset:"default" for subset in dataset.subsets()}
dataset.transform("map_subsets", mapping=mapping)
dataset.transform("reindex", start=0)
dataset.export("flattened", "datumaro", save_media=True)

If there are no duplicates among the file names, you could consider using the id_from_image_name transform. However, if there are duplicates, it will unfortunately result in a RepeatedItemError during export, meaning you cannot retain the original file names.

@CourchesneA
Copy link
Author

Hi @jihyeonyi , thanks for the information, that is exactly what I was looking for. Specifically, I think the id_from_image_name will do what I want

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants