Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive the 4 seasons runs to HPSS #2131

Closed
PeterCaldwell opened this issue Jan 20, 2023 · 4 comments
Closed

Archive the 4 seasons runs to HPSS #2131

PeterCaldwell opened this issue Jan 20, 2023 · 4 comments

Comments

@PeterCaldwell
Copy link
Contributor

PeterCaldwell commented Jan 20, 2023

My current strategy for archiving 4 seasons runs output isn't working. I've spent months trying to use zstash to upload our data to HPSS. In contrast, I was able to move all of our output data to NERSC (for each season) with ~10 min of work and ~3 hrs of wait time. It's time to give up on my half-done zstash archives and try something new. The critical problem seems to be tarring the files. Verifying file integrity is also problematic.

Plan:

  • Use globus directly to move all the netcdf files in each run directory to hpss. Each 3d file is ~200 GB in size and OLCF wants tar files of ~800 GB, so using tar doesn't do that much. We have ~60 files that are ~100 MB each though (T2m and Qv2m). In future runs, we should put all 3 variables in 1 file. I can't figure out the zstash command for just them (zstash has a "--exclude" but not a "--include"), so I've just moved the files individually. The appropriate globus endpoints are "OLCF DTN” and “OLCF HPSS”. Note that globus confirms checksums by default.
  • Archive everything in the case_scripts directory (this takes ~10 sec): zstash create --hpss /hpss/prod/cli115/world-shared/EAMxx/V1FourSeasons/DYAMOND2/case_scripts case_scripts/ --exclude="cmake_macros/*"
  • Archive all the non-netcdf stuff in the run directory (this takes ~10 sec): zstash create --hpss /hpss/prod/cli115/world-shared/EAMxx/V1FourSeasons/DYAMOND2/run run/ --exclude="*.nc,core.*"
  • Should I zstash check the 2 tar files I've created? We don't care a ton about these files (I think) so maybe not? But they're so small that checking should be easy?

Note that to check whether zstash actually grabbed the data desired, you need to use zstash ls -l --hpss="/hpss/prod/cli115/world-shared/EAMxx/V1FourSeasons/DYAMOND2/run". A bunch of variants I tried made it look like the archive was empty.

Does this plan seem workable? Am I missing any directories or files that we care about? Does not tarring things seem evil?

Globus transfers are still in flight, but otherwise I think I've finished this task.

@PeterCaldwell
Copy link
Contributor Author

Scratch and HPSS locations for each season:

DYAMOND2:

/gpfs/alpine/cli115/proj-shared/donahue/e3sm_scratch/ne1024pg2_ne1024pg2.F2010-SCREAMv1-DYAMOND2.20221220_dyamond2.24ab0b8bdbdebccd2fc717e55d3f8a8a16cebfff
/hpss/prod/cli115/world-shared/EAMxx/V1FourSeasons/DYAMOND2/

DYAMOND1:

/gpfs/alpine/cli115/proj-shared/donahue/e3sm_scratch/ne1024pg2_ne1024pg2.F2010-SCREAMv1-DYAMOND1.20221216_dyamond1.081538b5e95a5ae54533e533176fd29c2a1b98ab
/hpss/prod/cli115/world-shared/EAMxx/V1FourSeasons/DYAMOND1/

Oct 1:

/gpfs/alpine/cli115/proj-shared/donahue/e3sm_scratch/ne1024pg2_ne1024pg2.F2010-SCREAMv1.20221014_production_run.27604ccf3f1aaa88ea3413b774ef3817cad7343a/
/hpss/prod/cli115/world-shared/EAMxx/V1FourSeasons/Oct1_2013/

Apr 1:

/gpfs/alpine/cli115/proj-shared/donahue/e3sm_scratch/ne1024pg2_ne1024pg2.F2010-SCREAMv1.20221208_April_2013.ea484ef5161699ffd0192e23972d6858e5cd9be0
/hpss/prod/cli115/world-shared/EAMxx/V1FourSeasons/Apr1_2013/

@PeterCaldwell
Copy link
Contributor Author

PeterCaldwell commented Jan 21, 2023

Issues with zstash:
2. Does zstash have the ability to --include instead of just --exclude? This would be helpful when files are huge and only some of them should be tarred.
3. zstash create literally takes weeks to copy all the data from a run to HPSS and zstash's strategy of downloading all the data to verify file integrity takes even longer. This is unworkable for 50 TB/sim-month workloads.
4. I had issues with file permissions. Maybe zstash could automatically set collaboration-friendly defaults?
5. I had lots of issues with ghost "no disk space left" errors where some files would fail to get added to a tar file, but then identical-sized files would make it in fine. OLCF never figured out why.
6. There's weird inconsistency regarding whether --opt=x or --opt x is correct.

@PeterCaldwell
Copy link
Contributor Author

This ended up not really being a git issue, but rather a scratch pad for me to figure out what to do. I would like feedback on whether my strategy seems sound from @golaz , @AaronDonahue , and @crterai just to make sure we don't discover 3 months from now that we're missing some important datasets...

@AaronDonahue
Copy link
Contributor

This has been done and is documented in the 4-Seasons paper and on e3sm-docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants