Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.

Consider major re-design: Pre-process & save data (maybe as Zarr with batch_size = chunk_size) #58

Closed
6 tasks done
JackKelly opened this issue Jul 19, 2021 · 12 comments
Closed
6 tasks done

Comments

@JackKelly
Copy link
Member

JackKelly commented Jul 19, 2021

So a single Zarr chunk might be:

64 examples, each with 24 timesteps, each with 128 x 128 satellite & NWP; and PV (with coordinates time and pv_id)

Could probably re-use quite a lot of the existing code to create the training dataset

TODO:

  • Batch-convert Zarr batches to NetCDF4 (as described above)
  • Write new dataloader for NetCDF4 files. Map dataset. Each worker process loads a single batch at a time. Benchmark speed
  • Benchmark with ML model
  • Create Python script to batch-convert from original Zarr to NetCDF (with datetime features)
  • Create validation set
  • Modify model to use new NetCDF data (and don't compute datetime features on the fly!)
@JackKelly
Copy link
Member Author

Ultimately, the data pipeline for satellite imagery could look like:

  1. Raw data
  2. Reprojected data in Zarr, full bit depth etc.
  3. Pre-prepared ML training data (and it should be pretty easy to spit out multiple formats like Zarr or webdataset)

First steps:

Use existing code. Create a Dataloader with multiple worker processes. Create a Zarr of batches. If the chunks are large enough, use different data arrays for satellite vs NWPs. Definitely want separate data array for PV, maybe with PV metadata for each PV system in range stored in the Zarr for each example.

@JackKelly
Copy link
Member Author

Zarr chunk shape for satellite data:

example: 32, time: 19, variable:12, x: 32, y: 32,

Also need DataArrays to store the x, y, and time coordinates for each example. Maybe called sat_time (dims: example, time), sat_x, and sat_y (dims: example, {x, y})?

@JackKelly
Copy link
Member Author

JackKelly commented Jul 21, 2021

Actually, it might be better to save each batch as a netcdf file. Zarr saves a bunch of very small files with each batch (eg the coordinates) which are inefficient to load from a cloud bucket.

Maybe just as simple as having a directory filled with netcdf files where the filename is just an integer, giving the batch number.

And, to maximise efficiency on GCS, prepend the first 6 characters of the MD5 hash of the filename to the filename: https://cloud.google.com/storage/docs/request-rate

@JackKelly
Copy link
Member Author

NetCDF vs Zarr

tl:dr: Yes, I think I should use one NetCDF4 file per batch. Using a ramdisk and netcdf4 takes about 180 ms to load a batch, and the majority of that (100 ms to download from gcs) can run concurrently in the background. In contrast, opening a batch from Zarr takes between 250 ms to 500 ms.

Reading NetCDF4

gcs.get('gs://solar-pv-nowcasting-data/prepared_ML_training_data/test_nc4.nc', '/mnt/ramdisk/test_nc4.nc')
batch_loaded = xr.load_dataset('/mnt/ramdisk/test_nc4.nc')
  • gcs.get() takes about 100 ms to a ramdisk, and about 130 ms to a standard GCS VM disk (although it's possible I'm imagining those differences. Reading to the VM's local disk might be fine).
  • gcs.get() will concurrently download multiple files from GCS.
  • xr.load_dataset() takes about 75 ms from local disk or ram disk.
  • h5netcdf is about twice as slow at reading (my batch files) as netcdf4, but netcdf4 can only read from local filesystem.

Writing NetCDF4

The only engine which understands writing as lzf compressed files is h5netcdf! lzf files are slightly larger (16M) than zlib (11
M) but decompression takes twice as long (150 ms for zlib from ramdisk, compared to 75 ms for lzf).

Only the scipy engine can write directly to a byte array. But scipy can't write netcdf4 (only v3), and doesn't appear to understand compression! So we need to use h5netcdf to write to the local filesystem and then use gcs.put() to upload to GCS (which, again, can concurrently upload multiple files).

@JackKelly
Copy link
Member Author

JackKelly commented Jul 21, 2021

TODO:

  • Batch-convert Zarr batches to NetCDF4 (as described above)
  • Write new dataloader for NetCDF4 files. Map dataset. Each worker process loads a single batch at a time. Benchmark speed

@JackKelly
Copy link
Member Author

TODO:

  • Put datetime features into batch on disk

@JackKelly
Copy link
Member Author

JackKelly commented Jul 22, 2021

Looking good! Looks like, with 30 workers, the DataLoader returns a batch roughly every 20 ms (so should be able to do 50 it/s!) and hits about 680 MB/s. (This is just in a tight loop, iterating over the DataLoader... not actually training a model)

@JackKelly
Copy link
Member Author

JackKelly commented Jul 22, 2021

oh, wow, if the __getitem__ only gets data from disk and sticks it into an Example, and doesn't compute datetime features, then that time-per-batch drops to 12ms, and it can pull a little over 1 GB/sec of data, and the CPU usage hovers more like 50% per core, instead of 80% per core. Definitely should pre-compute datetime features.

@JackKelly
Copy link
Member Author

Swapping a ramdisk for localdisk doesn't seem to make a difference to read speed (for the temporary store)

JackKelly added a commit that referenced this issue Jul 22, 2021
@JackKelly
Copy link
Member Author

Using NetCDF4 (with LZF compression) also results in less disk usage: For exactly the same data, Zarr uses 610 GB, whilst NetCDF uses 469 GB.

@JackKelly
Copy link
Member Author

JackKelly commented Jul 22, 2021

Using NetCDF4 batches (still computing datetime features on-the-fly) increases training speed 12x (from 1.5 it/s to 19 it/s). Training loss looks very similar to previous experiment.

Other advantages:

  • CPU utilisation is lower (40% vs 100%)
  • RAM usage is lower (16 GB vs 40 GB)
  • GPU utilisation is higher (75% vs 50%)
  • The Dataset code is super-simple (just load a batch from disk & stick it into Tensors)
  • The ML training starts very quickly (10 seconds?) after the script starts
  • So should be able to get away with a less powerful VM (maybe 16 cores and 32 GB of RAM instead of 24 cores and 153 GB RAM!)

Network utilisation is about the same (~230 MB/s)

@JackKelly
Copy link
Member Author

Yup, putting datetime features into the saved data works, and maybe gives a very slight speed increase (up from 19 it/s to 19.5 it/s).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant