Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Add explanation and usage instructions for data configuration #1548

Merged
merged 11 commits into from
May 6, 2022
Merged
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions docs/en/tutorials/customize_datasets.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,84 @@
# Tutorial 2: Customize Datasets

## Data configuration

`data` in config file is the variable for data configuration, to define the arguments that are used in datasets and dataloaders.

Here is an example of data configuration:

```python
data = dict(
samples_per_gpu=4,
workers_per_gpu=4,
train=dict(
type='ADE20KDataset',
data_root='data/ade/ADEChallengeData2016',
img_dir='images/training',
ann_dir='annotations/training',
pipeline=train_pipeline),
val=dict(
type='ADE20KDataset',
data_root='data/ade/ADEChallengeData2016',
img_dir='images/validation',
ann_dir='annotations/validation',
pipeline=test_pipeline),
test=dict(
type='ADE20KDataset',
data_root='data/ade/ADEChallengeData2016',
img_dir='images/validation',
ann_dir='annotations/validation',
pipeline=test_pipeline))
```

- `train`, `val` and `test`: The [`config`](https://github.com/open-mmlab/mmcv/blob/master/docs/en/understand_mmcv/config.md)s to build dataset instances for model training, validation and testing by
using [`build and registry`](https://github.com/open-mmlab/mmcv/blob/master/docs/en/understand_mmcv/registry.md) mechanism.

- `samples_per_gpu`: How many samples per batch and per gpu to load during model training, and the `batch_size` of training is equal to `samples_per_gpu` times gpu number, e.g. when using 8 gpus for distributed data parallel trainig and `samples_per_gpu=2`, the `batch_size` is `8*2=16`.
MeowZheng marked this conversation as resolved.
Show resolved Hide resolved
If you would like to define `batch_size` for testing and validation, please use `test_dataloaser` and
`val_dataloader` with mmseg >=0.24.1.

- `workers_per_gpu`: How many subprocesses per gpu to use for data loading. `0` means that the data will be loaded in the main process.

**Note:** `samples_per_gpu` only works for model training, and the default setting of `samples_per_gpu` is 1 in mmseg when model testing and validation (DO NOT support batch inference yet).

**Note:** before v0.24.1, except `train`, `val` `test`, `samples_per_gpu` and `workers_per_gpu`, the other keys in `data` must be the
input keyword arguments for `dataloader` in pytorch, and the dataloaders used for model training, validation and testing have the same input arguments.
In v0.24.1, mmseg supports to use `train_dataloader`, `test_dataloaser` and `val_dataloader` to specify different keyword arguments, and still supports the overall arguments definition but the specific dataloader setting has a higher priority.

Here is an example for specific dataloader:

```python
data = dict(
samples_per_gpu=4,
workers_per_gpu=4,
shuffle=True,
train=dict(type='xxx', ...),
val=dict(type='xxx', ...),
test=dict(type='xxx', ...),
# Use different batch size during validation and testing.
val_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False),
test_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False))
```

Assume only one gpu used for model training and testing, as the priority of the overall arguments definition is low, the batch_size
for training is `4` and dataset will be shuffled, and batch_size for testing and validation is `1`, and dataset will not be shuffled.

To make data configuration much clearer, we recommend use specific dataloader setting instead of overall dataloader setting after v0.24.1, just like:

```python
data = dict(
train=dict(type='xxx', ...),
val=dict(type='xxx', ...),
test=dict(type='xxx', ...),
# Use specific dataloader setting
train_dataloader=dict(samples_per_gpu=4, workers_per_gpu=4, shuffle=True),
val_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False),
test_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False))
```

**Note:** in model training, default values in the script of mmseg for dataloader are `shuffle=True, and drop_last=True`,
in model validation and testing, default values are `shuffle=False, and drop_last=False`

## Customize datasets by reorganizing data

The simplest way is to convert your dataset to organize your data into folders.
Expand Down