-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Issue: Opening a DataSet from the Cloud takes a long time #14
Comments
some statistics about the current code efficiency: it downloaded 585 files (1960-1990) in 1395 minutes wich means 2.4 min/file. |
I checked the new functionality of bestiapop v2.5 with multiprocessing downloading the data from the cloud (1) and using local NetCDF4 files (2). I selected 32 years due to I have a machine with 32 cores, here are the examples and metrics of efficiency: (1) python bestiapop.py -a generate-met-file -y "1960-1991" -c "radiation max_temp min_temp daily_rain" -lat "-41.85 -41.8" -lon "147.25 147.3" -o D:\TEST -m (2) python bestiapop.py -a generate-met-file -y "1960-1991" -c "radiation max_temp min_temp daily_rain" -lat "-41.85 -41.8" -lon "147.25 147.3" -input D:\Netcdf -o D:\TEST -m Metrics:
See also #11 |
So these are good metrics. However as you mentioned in a call, there seems to have been a performance penalty for sequential (what was before the "standard" way) processing? This could be due to the way the data extraction loops are structured. We saw this before but I forgot, when using this for loop: for year in year_range:
for climate_variable in climate_variables:
for lat in lat_range:
for lon in lon_range:
#<do something here...> there is a performance penalty for sequential processing (non-parallelized). Whereas when using this for loop structure: for climate_variable in climate_variables:
for year in year_range:
for lat in lat_range:
for lon in lon_range:
#<do something here...> the performance is optimal for sequential processing. Were you able to confirm in your tests that the sequential processing seems to be slower now? If so, then we have two options here:
What do you reckon is the best choice? |
I have double-checked the performance of bestiapop v2.5 downloading data from the cloud and it still is low in comparison with the local version. I would consider option 2 but alert people about what could happen with the performance of the tool if they deactivate the multiprocessing and why. I applied the following code:
and I got the following metrics: 1.21 min/file and 2.26 sec/year. I do know why this time was faster that the previous example. |
@darkquasar I have calculated the current performance of bestiapop2.5 to create MET and WHT files from the SILOAPI with the activation (or not) of multiprocessing. Please have a look at the table. Seems to be an issue due to if -m is activated the process is slower than when -m is activated. Another result is that for files with a range of years the package worked slowly than for a file with a single year, which makes sense (results are in time consumed per year, not per file).
|
We need to run these benchmarks again with the new ability to extract data from NASAPOWER. Multiprocessing has also been fixed, there was an issue in the way longitude ranges were calculated which made it much slower, now it is really fast. API backend calls have also been improved enhancing speed. Your experience of BestiaPop should be much faster now regardless of -m or not, but with -m we should see a real performance difference. |
please have a look at the new |
There is an issue when opening NetCDF4 files from S3 using
xarray.open_dataset()
directly coupled withs3fs
. Regardless of network bandwidth it will take longer for some files than for others, file size remaining constant. It would seem this is due to the way s3fs and xarray handle the byte chunks when loading the remote file.A potential solution would involve using
fsspec
to open the remote file.The text was updated successfully, but these errors were encountered: