-
-
Notifications
You must be signed in to change notification settings - Fork 6
Consider major re-design: Pre-process & save data (maybe as Zarr with batch_size = chunk_size) #58
Comments
Ultimately, the data pipeline for satellite imagery could look like:
First steps: Use existing code. Create a Dataloader with multiple worker processes. Create a Zarr of batches. If the chunks are large enough, use different data arrays for satellite vs NWPs. Definitely want separate data array for PV, maybe with PV metadata for each PV system in range stored in the Zarr for each example. |
Zarr chunk shape for satellite data: example: 32, time: 19, variable:12, x: 32, y: 32, Also need DataArrays to store the x, y, and time coordinates for each example. Maybe called sat_time (dims: example, time), sat_x, and sat_y (dims: example, {x, y})? |
Actually, it might be better to save each batch as a netcdf file. Zarr saves a bunch of very small files with each batch (eg the coordinates) which are inefficient to load from a cloud bucket. Maybe just as simple as having a directory filled with netcdf files where the filename is just an integer, giving the batch number. And, to maximise efficiency on GCS, prepend the first 6 characters of the MD5 hash of the filename to the filename: https://cloud.google.com/storage/docs/request-rate |
NetCDF vs Zarrtl:dr: Yes, I think I should use one NetCDF4 file per batch. Using a ramdisk and Reading NetCDF4gcs.get('gs://solar-pv-nowcasting-data/prepared_ML_training_data/test_nc4.nc', '/mnt/ramdisk/test_nc4.nc')
batch_loaded = xr.load_dataset('/mnt/ramdisk/test_nc4.nc')
Writing NetCDF4The only engine which understands writing as Only the scipy engine can write directly to a byte array. But scipy can't write netcdf4 (only v3), and doesn't appear to understand compression! So we need to use |
TODO:
|
TODO:
|
Looking good! Looks like, with 30 workers, the DataLoader returns a batch roughly every 20 ms (so should be able to do 50 it/s!) and hits about 680 MB/s. (This is just in a tight loop, iterating over the DataLoader... not actually training a model) |
oh, wow, if the |
Swapping a ramdisk for localdisk doesn't seem to make a difference to read speed (for the temporary store) |
Using NetCDF4 (with LZF compression) also results in less disk usage: For exactly the same data, Zarr uses 610 GB, whilst NetCDF uses 469 GB. |
Using NetCDF4 batches (still computing datetime features on-the-fly) increases training speed 12x (from 1.5 it/s to 19 it/s). Training loss looks very similar to previous experiment. Other advantages:
Network utilisation is about the same (~230 MB/s) |
Yup, putting datetime features into the saved data works, and maybe gives a very slight speed increase (up from 19 it/s to 19.5 it/s). |
So a single Zarr chunk might be:
64 examples, each with 24 timesteps, each with 128 x 128 satellite & NWP; and PV (with coordinates time and pv_id)
Could probably re-use quite a lot of the existing code to create the training dataset
TODO:
The text was updated successfully, but these errors were encountered: