Flatten dataset #1610

CourchesneA · 2024-09-18T14:11:07Z

I have a datumaro dataset that has nested items, files often have paths such as mydir/file1.jpg. Is there a way to flatten it using datumaro ? I would like to iterate over each item, move the item.media.path to the root (possibly check if there's already a file of that name), update the item.id then re-export. Is there a way to do this ?

The text was updated successfully, but these errors were encountered:

jihyeonyi · 2024-09-20T03:59:33Z

Hi @CourchesneA, thank you for your continued interest.

Currently, there is no flatten feature in Datumaro, but there are tricky ways to achieve flattening.
When exporting a dataset in Datumaro format, the path of the image is determined by the id and the subset in the DatasetItem. For instance, if the id is "mydir/img1" and the subset is "mysubset", then it would be set as "images/mysubset/mydir/img1.jpg".
Therefore, if all subsets are the same, you can save all images in one folder (e.g., images/mysubset) by changing the id accordingly.
To achieve this, you can use the reindex transform (link).

dataset.transform("reindex", start=0)
dataset.export("flattened", "datumaro", save_media=True)

If the dataset contains multiple subsets, you should use the map_subsets transform to merge the subsets into one, then perform the reindex transform to prevent duplicate ids before exporting.

mapping = {subset:"default" for subset in dataset.subsets()}
dataset.transform("map_subsets", mapping=mapping)
dataset.transform("reindex", start=0)
dataset.export("flattened", "datumaro", save_media=True)

If there are no duplicates among the file names, you could consider using the id_from_image_name transform. However, if there are duplicates, it will unfortunately result in a RepeatedItemError during export, meaning you cannot retain the original file names.

CourchesneA · 2024-09-20T13:00:39Z

Hi @jihyeonyi , thanks for the information, that is exactly what I was looking for. Specifically, I think the id_from_image_name will do what I want

github-actions bot assigned jihyeonyi Sep 18, 2024

CourchesneA closed this as completed Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flatten dataset #1610

Flatten dataset #1610

CourchesneA commented Sep 18, 2024

jihyeonyi commented Sep 20, 2024

CourchesneA commented Sep 20, 2024

Flatten dataset #1610

Flatten dataset #1610

Comments

CourchesneA commented Sep 18, 2024

jihyeonyi commented Sep 20, 2024

CourchesneA commented Sep 20, 2024