Consider major re-design: Pre-process & save data (maybe as Zarr with batch_size = chunk_size) #58

JackKelly · 2021-07-19T20:24:12Z

So a single Zarr chunk might be:

64 examples, each with 24 timesteps, each with 128 x 128 satellite & NWP; and PV (with coordinates time and pv_id)

Could probably re-use quite a lot of the existing code to create the training dataset

TODO:

Batch-convert Zarr batches to NetCDF4 (as described above)
Write new dataloader for NetCDF4 files. Map dataset. Each worker process loads a single batch at a time. Benchmark speed
Benchmark with ML model
Create Python script to batch-convert from original Zarr to NetCDF (with datetime features)
Create validation set
Modify model to use new NetCDF data (and don't compute datetime features on the fly!)

JackKelly · 2021-07-20T06:58:58Z

Ultimately, the data pipeline for satellite imagery could look like:

Raw data
Reprojected data in Zarr, full bit depth etc.
Pre-prepared ML training data (and it should be pretty easy to spit out multiple formats like Zarr or webdataset)

First steps:

Use existing code. Create a Dataloader with multiple worker processes. Create a Zarr of batches. If the chunks are large enough, use different data arrays for satellite vs NWPs. Definitely want separate data array for PV, maybe with PV metadata for each PV system in range stored in the Zarr for each example.

JackKelly · 2021-07-20T09:54:11Z

Zarr chunk shape for satellite data:

example: 32, time: 19, variable:12, x: 32, y: 32,

Also need DataArrays to store the x, y, and time coordinates for each example. Maybe called sat_time (dims: example, time), sat_x, and sat_y (dims: example, {x, y})?

JackKelly · 2021-07-21T06:16:25Z

Actually, it might be better to save each batch as a netcdf file. Zarr saves a bunch of very small files with each batch (eg the coordinates) which are inefficient to load from a cloud bucket.

Maybe just as simple as having a directory filled with netcdf files where the filename is just an integer, giving the batch number.

And, to maximise efficiency on GCS, prepend the first 6 characters of the MD5 hash of the filename to the filename: https://cloud.google.com/storage/docs/request-rate

JackKelly · 2021-07-21T16:28:20Z

NetCDF vs Zarr

tl:dr: Yes, I think I should use one NetCDF4 file per batch. Using a ramdisk and netcdf4 takes about 180 ms to load a batch, and the majority of that (100 ms to download from gcs) can run concurrently in the background. In contrast, opening a batch from Zarr takes between 250 ms to 500 ms.

Reading NetCDF4

gcs.get('gs://solar-pv-nowcasting-data/prepared_ML_training_data/test_nc4.nc', '/mnt/ramdisk/test_nc4.nc')
batch_loaded = xr.load_dataset('/mnt/ramdisk/test_nc4.nc')

gcs.get() takes about 100 ms to a ramdisk, and about 130 ms to a standard GCS VM disk (although it's possible I'm imagining those differences. Reading to the VM's local disk might be fine).
gcs.get() will concurrently download multiple files from GCS.
xr.load_dataset() takes about 75 ms from local disk or ram disk.
h5netcdf is about twice as slow at reading (my batch files) as netcdf4, but netcdf4 can only read from local filesystem.

Writing NetCDF4

The only engine which understands writing as lzf compressed files is h5netcdf! lzf files are slightly larger (16M) than zlib (11
M) but decompression takes twice as long (150 ms for zlib from ramdisk, compared to 75 ms for lzf).

Only the scipy engine can write directly to a byte array. But scipy can't write netcdf4 (only v3), and doesn't appear to understand compression! So we need to use h5netcdf to write to the local filesystem and then use gcs.put() to upload to GCS (which, again, can concurrently upload multiple files).

JackKelly · 2021-07-21T17:06:14Z

TODO:

Batch-convert Zarr batches to NetCDF4 (as described above)
Write new dataloader for NetCDF4 files. Map dataset. Each worker process loads a single batch at a time. Benchmark speed

JackKelly · 2021-07-22T14:48:50Z

TODO:

Put datetime features into batch on disk

JackKelly · 2021-07-22T15:06:24Z

Looking good! Looks like, with 30 workers, the DataLoader returns a batch roughly every 20 ms (so should be able to do 50 it/s!) and hits about 680 MB/s. (This is just in a tight loop, iterating over the DataLoader... not actually training a model)

JackKelly · 2021-07-22T15:18:24Z

oh, wow, if the __getitem__ only gets data from disk and sticks it into an Example, and doesn't compute datetime features, then that time-per-batch drops to 12ms, and it can pull a little over 1 GB/sec of data, and the CPU usage hovers more like 50% per core, instead of 80% per core. Definitely should pre-compute datetime features.

JackKelly · 2021-07-22T15:41:33Z

Swapping a ramdisk for localdisk doesn't seem to make a difference to read speed (for the temporary store)

…rch Dataset. And it's fast :) #58

JackKelly · 2021-07-22T16:27:40Z

Using NetCDF4 (with LZF compression) also results in less disk usage: For exactly the same data, Zarr uses 610 GB, whilst NetCDF uses 469 GB.

JackKelly · 2021-07-22T17:08:01Z

Using NetCDF4 batches (still computing datetime features on-the-fly) increases training speed 12x (from 1.5 it/s to 19 it/s). Training loss looks very similar to previous experiment.

Other advantages:

CPU utilisation is lower (40% vs 100%)
RAM usage is lower (16 GB vs 40 GB)
GPU utilisation is higher (75% vs 50%)
The Dataset code is super-simple (just load a batch from disk & stick it into Tensors)
The ML training starts very quickly (10 seconds?) after the script starts
So should be able to get away with a less powerful VM (maybe 16 cores and 32 GB of RAM instead of 24 cores and 153 GB RAM!)

Network utilisation is about the same (~230 MB/s)

JackKelly · 2021-07-23T09:56:51Z

Yup, putting datetime features into the saved data works, and maybe gives a very slight speed increase (up from 19 it/s to 19.5 it/s).

JackKelly mentioned this issue Jul 20, 2021

Don't read entire chunks at a time #57

Closed

JackKelly added a commit that referenced this issue Jul 21, 2021

experimenting with saving batches as netcdf4 #58

481ffed

JackKelly added a commit that referenced this issue Jul 22, 2021

Convert single batches to NetCDF files works; as does the NetCDF PyTo…

4e026ea

…rch Dataset. And it's fast :) #58

JackKelly mentioned this issue Jul 22, 2021

Preprocess batches and load those (Instead of processing on the fly) openclimatefix/satflow#63

Closed

JackKelly added a commit that referenced this issue Jul 22, 2021

19it/s loading from NetCDF. #58

5aa7956

JackKelly added a commit that referenced this issue Jul 22, 2021

Data conversion script appears to be running #58

bb2102d

JackKelly added a commit that referenced this issue Jul 23, 2021

Checks that paths exists. Add docstring. #58

bf90f31

JackKelly closed this as completed Jul 23, 2021

JackKelly added a commit that referenced this issue Jul 23, 2021

Fixing bug with datetime features saved to disk. #58

382ccdc

JackKelly added a commit that referenced this issue Jul 23, 2021

removing creation of datetime features on the fly. #58

600443d

This was referenced Jul 23, 2021

Ensure we always validate against exactly the same examples #37

Closed

Use 1/4 spatial, and only read from disk what we need. Or only sample from one quarter per batch. #49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider major re-design: Pre-process & save data (maybe as Zarr with batch_size = chunk_size) #58

Consider major re-design: Pre-process & save data (maybe as Zarr with batch_size = chunk_size) #58

JackKelly commented Jul 19, 2021 •

edited

Loading

JackKelly commented Jul 20, 2021

JackKelly commented Jul 20, 2021

JackKelly commented Jul 21, 2021 •

edited

Loading

JackKelly commented Jul 21, 2021

JackKelly commented Jul 21, 2021 •

edited

Loading

JackKelly commented Jul 22, 2021

JackKelly commented Jul 22, 2021 •

edited

Loading

JackKelly commented Jul 22, 2021 •

edited

Loading

JackKelly commented Jul 22, 2021

JackKelly commented Jul 22, 2021

JackKelly commented Jul 22, 2021 •

edited

Loading

JackKelly commented Jul 23, 2021

Consider major re-design: Pre-process & save data (maybe as Zarr with batch_size = chunk_size) #58

Consider major re-design: Pre-process & save data (maybe as Zarr with batch_size = chunk_size) #58

Comments

JackKelly commented Jul 19, 2021 • edited Loading

JackKelly commented Jul 20, 2021

JackKelly commented Jul 20, 2021

JackKelly commented Jul 21, 2021 • edited Loading

JackKelly commented Jul 21, 2021

NetCDF vs Zarr

Reading NetCDF4

Writing NetCDF4

JackKelly commented Jul 21, 2021 • edited Loading

JackKelly commented Jul 22, 2021

JackKelly commented Jul 22, 2021 • edited Loading

JackKelly commented Jul 22, 2021 • edited Loading

JackKelly commented Jul 22, 2021

JackKelly commented Jul 22, 2021

JackKelly commented Jul 22, 2021 • edited Loading

JackKelly commented Jul 23, 2021

JackKelly commented Jul 19, 2021 •

edited

Loading

JackKelly commented Jul 21, 2021 •

edited

Loading

JackKelly commented Jul 21, 2021 •

edited

Loading

JackKelly commented Jul 22, 2021 •

edited

Loading

JackKelly commented Jul 22, 2021 •

edited

Loading

JackKelly commented Jul 22, 2021 •

edited

Loading