Costing of 1/40th simulation #23

adele-morrison · 2023-02-14T05:05:50Z

NCI have asked for an updated costing of our planned 1/40th run. Based on scaling up the 1/20th, I have:

1/20th:
cpus: 3744
Walltime: 2 hr 45 mins for 1 month, = 33 hours for 1 year
SU: 22 kSU for 1 month, = 264 kSU for 1 year

1/40th:
cpus: 4 x cpus ~= 15000 cpu
Walltime: 2x time step = 66 hours / year
SU: 8x SU = 2.1 MSU / year

If we run 10 years on 15000 cpu that will cost 21 MSU (plus more for crashes and for when we output high frequency data) and take 28 days…

It sounds like we may need to ask for some more than that because from their current bench-marking, the new core performance is not as good as the existing Cascade Lakes. One problem I see here is that this is going to take 28 days (assuming it runs continuously)! Do we think there's any possibility of scaling this up to more cores to get more throughput? @micaeljtoliveira @angus-g @aekiss @AndyHoggANU ?

adele-morrison · 2023-02-14T05:07:15Z

Costing for storage:

1/20th current standard output is 16 GB per month = 192 GB / year.
x4 for 1/40th, this would be 768 GB / year. So 7.7 TB for 10 years.

Current 1/20th daily output is 4.3 GB for 1 month, which has 7 variables at daily frequency. We might want u, v, and transport on 3 density surfaces, plus depth of those density surfaces, so 8 variables at 3 hourly frequency. So this would be 157 GB for 1 month ~= 2 TB for 1 year.

So perhaps asking for 20 TB would be sufficient.

Thoughts on any of this appreciated.

AndyHoggANU · 2023-02-14T09:48:40Z

I am hoping that, with a little work, we can get more scaling out of the model - perhaps 30k cores?. I saw that Ben asked if we needed help on the model ... perhaps we could ask for Paul or Rui to help @micaeljtoliveira with the profiling?

aekiss · 2023-02-15T00:30:23Z

These estimates assume perfect parallel speedup, which is a best-case scenario and unlikely in practice.

Also more cores=more crashes so actual throughput may not improve as much as we'd like, and SU requirements with more cores would increase by both crashes and imperfect parallel scaling.

So these SU estimates are a lower bound and we won't know how close we get to it without doing test runs.

adele-morrison · 2023-02-15T01:06:22Z

Yep agreed. I’ve passed on these numbers to Al and Ben at NCI and indicated that it’s a lower bound. I’ve also asked if they can help with optimisation to see if we can get any speed up.

micaeljtoliveira · 2023-02-15T23:48:47Z

perhaps we could ask for Paul or Rui to help @micaeljtoliveira with the profiling?

This could be useful if we decide to get a more fine-grained profiling. Currently I'm using the timings provided by MOM6, but these cover relatively large portions of the code. To get a more detailed profiling, we will probably need to use specialized HPC instrumentation tools and NCI staff will surely have lots of experience with those.

micaeljtoliveira · 2023-04-20T00:09:06Z

Looks like the actual numbers are a bit better than predicted. Using ~10000 cores, one gets:
Walltime : 74.4 hours / year
SU: 1.5 MSU / year

The question now is if we can use more cores and what is the parallel efficiency in that case.

willaguiar · 2023-06-04T23:37:20Z

Update on the cost: Panan 1/40th have been running for a few months now, and it has indeed been using ~1.5 MSU /yr. With the tile collation I believe the final cost of the 10yrs simulation will be about 16MSU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Costing of 1/40th simulation #23

Costing of 1/40th simulation #23

adele-morrison commented Feb 14, 2023

adele-morrison commented Feb 14, 2023

AndyHoggANU commented Feb 14, 2023

aekiss commented Feb 15, 2023

adele-morrison commented Feb 15, 2023

micaeljtoliveira commented Feb 15, 2023

micaeljtoliveira commented Apr 20, 2023

willaguiar commented Jun 4, 2023

Costing of 1/40th simulation #23

Costing of 1/40th simulation #23

Comments

adele-morrison commented Feb 14, 2023

adele-morrison commented Feb 14, 2023

AndyHoggANU commented Feb 14, 2023

aekiss commented Feb 15, 2023

adele-morrison commented Feb 15, 2023

micaeljtoliveira commented Feb 15, 2023

micaeljtoliveira commented Apr 20, 2023

willaguiar commented Jun 4, 2023