Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

808 Checkpoint/Restart collection #809

Merged
merged 12 commits into from
Jun 5, 2020
Merged

Conversation

lifflander
Copy link
Collaborator

@lifflander lifflander commented May 11, 2020

Fixes #808

Has a test, working.

@codecov
Copy link

codecov bot commented May 11, 2020

Codecov Report

Merging #809 into develop will increase coverage by 0.34%.
The diff coverage is 95.75%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #809      +/-   ##
===========================================
+ Coverage    80.07%   80.41%   +0.34%     
===========================================
  Files          343      347       +4     
  Lines        10724    10933     +209     
===========================================
+ Hits          8587     8792     +205     
- Misses        2137     2141       +4     
Impacted Files Coverage Δ
src/vt/vrt/collection/manager.impl.h 85.24% <88.88%> (+1.22%) ⬆️
src/vt/vrt/collection/collection_directory.h 100.00% <100.00%> (ø)
src/vt/vrt/collection/staged_token/token.h 100.00% <100.00%> (ø)
tests/unit/collection/test_checkpoint.cc 100.00% <100.00%> (ø)
src/vt/vrt/collection/staged_token/token.impl.h 100.00% <0.00%> (ø)
src/vt/topos/mapping/dense/dense.impl.h 98.21% <0.00%> (+0.06%) ⬆️

@lifflander lifflander force-pushed the 808-restart-collection branch from bad9807 to a4021ea Compare May 11, 2020 20:36
@PhilMiller
Copy link
Member

This generally looks good, and pretty straightforward.

Something we may want to consider in checkpointToFile is putting the serialized size in the first sizeof(size_t) bytes, before the payload, so that we can compare that to the stat'd size of the file that we read in restartFromFile

@PhilMiller
Copy link
Member

This does make an assumption of the array being dense in its index space.

Looking ahead a little bit, will that be a problem if we wanted to use this in NimbleSM? Does that even care about preserving a collection instance across a hypothetical checkpoint?

@PhilMiller
Copy link
Member

As we look toward large-scale runs, we're going to have a lot of objects. We may want to create subdirectories to hold limited numbers of objects each. I was going to suggest per-index-component, but that falls over if one of the index components is simply process rank, since even that gets too big. Instead, maybe we should hash the indices, and use a few bytes from the hash to set subdirectories?

@lifflander
Copy link
Collaborator Author

This does make an assumption of the array being dense in its index space.

Looking ahead a little bit, will that be a problem if we wanted to use this in NimbleSM? Does that even care about preserving a collection instance across a hypothetical checkpoint?

I can fix that easily by just not failing if the file is not found during restart---assume it never existed.

@PhilMiller
Copy link
Member

This does make an assumption of the array being dense in its index space.
Looking ahead a little bit, will that be a problem if we wanted to use this in NimbleSM? Does that even care about preserving a collection instance across a hypothetical checkpoint?

I can fix that easily by just not failing if the file is not found during restart---assume it never existed.

That seems a bit risky - How about each process recording the objects it knows it's the home process for, in an 'indices.$node' file?

@lifflander lifflander force-pushed the 808-restart-collection branch from a4021ea to 88e3e02 Compare June 5, 2020 18:09
@lifflander lifflander self-assigned this Jun 5, 2020
@lifflander lifflander marked this pull request as ready for review June 5, 2020 18:09
@lifflander lifflander force-pushed the 808-restart-collection branch from 88e3e02 to 94ea92c Compare June 5, 2020 18:11
@lifflander
Copy link
Collaborator Author

@PhilMiller I've addressed all your concerns and written a test for it!

Copy link
Collaborator Author

Codacy Here is an overview of what got changed by this pull request:

Clones added
============
- tests/unit/collection/test_checkpoint.cc  2
         

See the complete overview on Codacy

@PhilMiller
Copy link
Member

All lgtm

@lifflander
Copy link
Collaborator Author

@PhilMiller Good for beta.9?

@lifflander lifflander merged commit e34ddc5 into develop Jun 5, 2020
@PhilMiller
Copy link
Member

PhilMiller commented Jun 5, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restart collection from serialized files per element
2 participants