Performance Issue: Opening a DataSet from the Cloud takes a long time #14

darkquasar · 2020-05-04T10:18:59Z

There is an issue when opening NetCDF4 files from S3 using xarray.open_dataset() directly coupled with s3fs. Regardless of network bandwidth it will take longer for some files than for others, file size remaining constant. It would seem this is due to the way s3fs and xarray handle the byte chunks when loading the remote file.

A potential solution would involve using fsspec to open the remote file.

The text was updated successfully, but these errors were encountered:

JJguri · 2020-05-04T11:30:13Z

some statistics about the current code efficiency: it downloaded 585 files (1960-1990) in 1395 minutes wich means 2.4 min/file.

JJguri · 2020-06-09T23:16:36Z

I checked the new functionality of bestiapop v2.5 with multiprocessing downloading the data from the cloud (1) and using local NetCDF4 files (2). I selected 32 years due to I have a machine with 32 cores, here are the examples and metrics of efficiency:

(1) python bestiapop.py -a generate-met-file -y "1960-1991" -c "radiation max_temp min_temp daily_rain" -lat "-41.85 -41.8" -lon "147.25 147.3" -o D:\TEST -m

(2) python bestiapop.py -a generate-met-file -y "1960-1991" -c "radiation max_temp min_temp daily_rain" -lat "-41.85 -41.8" -lon "147.25 147.3" -input D:\Netcdf -o D:\TEST -m

Metrics:

Downloading from the cloud: code efficiency = 3.39 min/file and 6.4 sec/year
Using local NetCDF4 files = 0.22 min/file and 0.4 sec/year !!!!!!

See also #11

darkquasar · 2020-06-10T08:06:51Z

So these are good metrics. However as you mentioned in a call, there seems to have been a performance penalty for sequential (what was before the "standard" way) processing? This could be due to the way the data extraction loops are structured. We saw this before but I forgot, when using this for loop:

for year in year_range:
    for climate_variable in climate_variables:
        for lat in lat_range:
            for lon in lon_range:
                #<do something here...>

there is a performance penalty for sequential processing (non-parallelized). Whereas when using this for loop structure:

for climate_variable in climate_variables:
    for year in year_range:
        for lat in lat_range:
            for lon in lon_range:
                #<do something here...>

the performance is optimal for sequential processing.

Were you able to confirm in your tests that the sequential processing seems to be slower now? If so, then we have two options here:

Allow for the execution of both modes but switch the way the loops are structured depending on the users' choice (either sequential or parallel processing). This option requires some medium-level code adjustments
Make parallel processing the default, since most systems nowadays are multicore. The caveat with this one is that users might sometimes want to run something not in parallel because they are running other heavy computations on the side. The option to switch off multiprocessing would be added though, but with the aforementioned performance penalties considered.

What do you reckon is the best choice?

JJguri · 2020-06-10T09:43:57Z

I have double-checked the performance of bestiapop v2.5 downloading data from the cloud and it still is low in comparison with the local version. I would consider option 2 but alert people about what could happen with the performance of the tool if they deactivate the multiprocessing and why.

I applied the following code:

python bestiapop.py -a generate-met-file -y "1960-1991" -c "radiation max_temp min_temp daily_rain" -lat "-41.15 -41.05" -lon "145.5 145.6" -o D:\TEST -m

and I got the following metrics: 1.21 min/file and 2.26 sec/year. I do know why this time was faster that the previous example.

JJguri · 2020-08-14T04:55:58Z

@darkquasar I have calculated the current performance of bestiapop2.5 to create MET and WHT files from the SILOAPI with the activation (or not) of multiprocessing. Please have a look at the table. Seems to be an issue due to if -m is activated the process is slower than when -m is activated. Another result is that for files with a range of years the package worked slowly than for a file with a single year, which makes sense (results are in time consumed per year, not per file).

ID	FileType	Time/year for a file with 10 years (sec)	Time/year for a file with 1 year (sec)
SILO API -m deactivated	MET	2.6	1
SILO API -m activated	MET	1.9	16
SILO API -m deactivated	WHT	1.3	1
SILO API -m activated	WHT	2	16

darkquasar · 2020-08-25T13:09:34Z

We need to run these benchmarks again with the new ability to extract data from NASAPOWER. Multiprocessing has also been fixed, there was an issue in the way longitude ranges were calculated which made it much slower, now it is really fast. API backend calls have also been improved enhancing speed. Your experience of BestiaPop should be much faster now regardless of -m or not, but with -m we should see a real performance difference.

JJguri · 2020-08-27T02:28:20Z

please have a look at the new BestiaPop performance section in the documentation.

darkquasar added the enhancement New feature or request label May 4, 2020

darkquasar self-assigned this May 4, 2020

darkquasar linked a pull request Aug 11, 2020 that will close this issue

Dev #23

Merged

darkquasar closed this as completed in #23 Aug 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Issue: Opening a DataSet from the Cloud takes a long time #14

Performance Issue: Opening a DataSet from the Cloud takes a long time #14

darkquasar commented May 4, 2020

JJguri commented May 4, 2020

JJguri commented Jun 9, 2020 •

edited

Loading

darkquasar commented Jun 10, 2020

JJguri commented Jun 10, 2020

JJguri commented Aug 14, 2020 •

edited

Loading

darkquasar commented Aug 25, 2020

JJguri commented Aug 27, 2020 •

edited

Loading

Performance Issue: Opening a DataSet from the Cloud takes a long time #14

Performance Issue: Opening a DataSet from the Cloud takes a long time #14

Comments

darkquasar commented May 4, 2020

JJguri commented May 4, 2020

JJguri commented Jun 9, 2020 • edited Loading

darkquasar commented Jun 10, 2020

JJguri commented Jun 10, 2020

JJguri commented Aug 14, 2020 • edited Loading

darkquasar commented Aug 25, 2020

JJguri commented Aug 27, 2020 • edited Loading

JJguri commented Jun 9, 2020 •

edited

Loading

JJguri commented Aug 14, 2020 •

edited

Loading

JJguri commented Aug 27, 2020 •

edited

Loading