Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issue: Opening a DataSet from the Cloud takes a long time #14

Closed
darkquasar opened this issue May 4, 2020 · 7 comments · Fixed by #23
Closed

Performance Issue: Opening a DataSet from the Cloud takes a long time #14

darkquasar opened this issue May 4, 2020 · 7 comments · Fixed by #23
Assignees
Labels
enhancement New feature or request

Comments

@darkquasar
Copy link
Collaborator

There is an issue when opening NetCDF4 files from S3 using xarray.open_dataset() directly coupled with s3fs. Regardless of network bandwidth it will take longer for some files than for others, file size remaining constant. It would seem this is due to the way s3fs and xarray handle the byte chunks when loading the remote file.

A potential solution would involve using fsspec to open the remote file.

@darkquasar darkquasar added the enhancement New feature or request label May 4, 2020
@darkquasar darkquasar self-assigned this May 4, 2020
@JJguri
Copy link
Owner

JJguri commented May 4, 2020

some statistics about the current code efficiency: it downloaded 585 files (1960-1990) in 1395 minutes wich means 2.4 min/file.

@JJguri
Copy link
Owner

JJguri commented Jun 9, 2020

I checked the new functionality of bestiapop v2.5 with multiprocessing downloading the data from the cloud (1) and using local NetCDF4 files (2). I selected 32 years due to I have a machine with 32 cores, here are the examples and metrics of efficiency:

(1) python bestiapop.py -a generate-met-file -y "1960-1991" -c "radiation max_temp min_temp daily_rain" -lat "-41.85 -41.8" -lon "147.25 147.3" -o D:\TEST -m

(2) python bestiapop.py -a generate-met-file -y "1960-1991" -c "radiation max_temp min_temp daily_rain" -lat "-41.85 -41.8" -lon "147.25 147.3" -input D:\Netcdf -o D:\TEST -m

Metrics:

  • Downloading from the cloud: code efficiency = 3.39 min/file and 6.4 sec/year
  • Using local NetCDF4 files = 0.22 min/file and 0.4 sec/year !!!!!!

See also #11

@darkquasar
Copy link
Collaborator Author

So these are good metrics. However as you mentioned in a call, there seems to have been a performance penalty for sequential (what was before the "standard" way) processing? This could be due to the way the data extraction loops are structured. We saw this before but I forgot, when using this for loop:

for year in year_range:
    for climate_variable in climate_variables:
        for lat in lat_range:
            for lon in lon_range:
                #<do something here...>

there is a performance penalty for sequential processing (non-parallelized). Whereas when using this for loop structure:

for climate_variable in climate_variables:
    for year in year_range:
        for lat in lat_range:
            for lon in lon_range:
                #<do something here...>

the performance is optimal for sequential processing.

Were you able to confirm in your tests that the sequential processing seems to be slower now? If so, then we have two options here:

  1. Allow for the execution of both modes but switch the way the loops are structured depending on the users' choice (either sequential or parallel processing). This option requires some medium-level code adjustments
  2. Make parallel processing the default, since most systems nowadays are multicore. The caveat with this one is that users might sometimes want to run something not in parallel because they are running other heavy computations on the side. The option to switch off multiprocessing would be added though, but with the aforementioned performance penalties considered.

What do you reckon is the best choice?

@JJguri
Copy link
Owner

JJguri commented Jun 10, 2020

I have double-checked the performance of bestiapop v2.5 downloading data from the cloud and it still is low in comparison with the local version. I would consider option 2 but alert people about what could happen with the performance of the tool if they deactivate the multiprocessing and why.

I applied the following code:

python bestiapop.py -a generate-met-file -y "1960-1991" -c "radiation max_temp min_temp daily_rain" -lat "-41.15 -41.05" -lon "145.5 145.6" -o D:\TEST -m

and I got the following metrics: 1.21 min/file and 2.26 sec/year. I do know why this time was faster that the previous example.

@darkquasar darkquasar linked a pull request Aug 11, 2020 that will close this issue
Merged
@JJguri
Copy link
Owner

JJguri commented Aug 14, 2020

@darkquasar I have calculated the current performance of bestiapop2.5 to create MET and WHT files from the SILOAPI with the activation (or not) of multiprocessing. Please have a look at the table. Seems to be an issue due to if -m is activated the process is slower than when -m is activated. Another result is that for files with a range of years the package worked slowly than for a file with a single year, which makes sense (results are in time consumed per year, not per file).

ID FileType Time/year for a file with 10 years (sec) Time/year for a file with 1 year (sec)
SILO API -m deactivated MET 2.6 1
SILO API -m activated MET 1.9 16
SILO API -m deactivated WHT 1.3 1
SILO API -m activated WHT 2 16

@darkquasar
Copy link
Collaborator Author

We need to run these benchmarks again with the new ability to extract data from NASAPOWER. Multiprocessing has also been fixed, there was an issue in the way longitude ranges were calculated which made it much slower, now it is really fast. API backend calls have also been improved enhancing speed. Your experience of BestiaPop should be much faster now regardless of -m or not, but with -m we should see a real performance difference.

@JJguri
Copy link
Owner

JJguri commented Aug 27, 2020

please have a look at the new BestiaPop performance section in the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants