-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split subset #1281
Labels
user experience
Questions about our products or things to improve user experience
Comments
Hi @CourchesneA,
import datumaro as dm
import numpy as np
# Create a synthetic dataset from code
src_dataset = dm.Dataset.from_iterable(
[
dm.DatasetItem(
id=f"{subset}_{idx}",
subset=subset,
media=dm.Image.from_numpy(np.zeros([3, 10, 10])),
annotations=[dm.Label(label=idx % 2)]
)
for idx in range(20)
for subset in ["train", "test"]
],
categories=["cat", "dog"],
)
print(src_dataset) Dataset
size=40
source_path=None
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=40
annotations_count=40
subsets
test: # of items=20, # of annotated items=20, # of annotations=20, annotation types=['label']
train: # of items=20, # of annotated items=20, # of annotations=20, annotation types=['label']
infos
categories
label: ['cat', 'dog']
train_only_dataset = dm.Dataset(source=src_dataset.get_subset("train"))
test_only_dataset = dm.Dataset(source=src_dataset.get_subset("test"))
print(train_only_dataset) Dataset
size=20
source_path=None
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=20
annotations_count=20
subsets
train: # of items=20, # of annotated items=20, # of annotations=20, annotation types=['label']
infos
categories
label: ['cat', 'dog']
train_val_dataset = train_only_dataset.transform(
"random_split",
splits=[("train", 0.67), ("val", 0.33)],
)
print(train_val_dataset) Dataset
size=20
source_path=None
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=20
annotations_count=20
subsets
train: # of items=13, # of annotated items=13, # of annotations=13, annotation types=['label']
val: # of items=7, # of annotated items=7, # of annotations=7, annotation types=['label']
infos
categories
label: ['cat', 'dog']
dst_dataset = dm.HLOps.merge(train_val_dataset, test_only_dataset)
print(dst_dataset) Dataset
size=40
source_path=None
media_type=<class 'datumaro.components.media.Image'>
annotated_items_count=40
annotations_count=40
subsets
test: # of items=20, # of annotated items=20, # of annotations=20, annotation types=['label']
train: # of items=13, # of annotated items=13, # of annotations=13, annotation types=['label']
val: # of items=7, # of annotated items=7, # of annotations=7, annotation types=['label']
infos
categories
label: ['cat', 'dog'] |
vinnamkim
added
the
user experience
Questions about our products or things to improve user experience
label
Feb 29, 2024
That's exactly what I was looking for, thanks for the detailed example ! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have a case where my dataset comes already split in "train" and "test", but I would need to add a validation set.
It seems like the "split" transform is unable to do this, it would only merge everything together as a first step.
Is there a way to acheive this ? I would like either to be able to specifcy a subset in the "split" transform, or execute the split on a subset and then reassign / overwrite an existing subset of my original dataset.
ex. before:
after:
I would need the test set to be untouched, i.e. it should contain the same items as before
The text was updated successfully, but these errors were encountered: