-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI_Info update (2nd / corrected PR) #1308
Conversation
c44bd41
to
5dcaa3f
Compare
Test FAILed. |
@dsolt will fix the "make check" problems and update this PR. |
@@ -65,7 +65,9 @@ headers = \ | |||
strncpy.h \ | |||
sys_limits.h \ | |||
timings.h \ | |||
uri.h | |||
uri.h \ | |||
info_subscriber.h \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems you used a tab instead of whitespaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm surprised there are not lots of these (maybe there are). I wasn't super careful because I thought that the commit hook script checked for this. I don't think my commit hook script is setup correctly though because the copyrights did not get updated either (as you noted in another comment). I'm looking into that.
@dsolt generally speaking, you have the option to update the copyright at the beginning of each file you update (e.g. 2016 IBM) |
ompi_communicator_t, ompi_win_t, ompi_file_t all have a super class of type opal_infosubscriber_t instead of a base/super type of opal_object_t (in previous code comm used c_base, but file used super). It may be a bit bold to say that being a subscriber of MPI_Info is the foundational piece that ties these three things together, but if you object, then I would prefer to turn infosubscriber into a more general name that encompasses other common features rather than create a different super class. The key here is that we want to be able to pass comm, win and file objects as if they were opal_infosubscriber_t, so that one routine can heandle all 3 types of objects being passed to it. MPI_INFO_NULL is still an ompi_predefined_info_t type since an MPI_Info is part of ompi but the internal details of the underlying information concept is part of opal. An ompi_info_t type still exists for exposure to the user, but it is simply a wrapper for the opal object. Routines such as ompi_info_dup, etc have all been moved to opal_info_dup and related to the opal directory. Fortran to C translation tables are only used for MPI_Info that is exposed to the application and are therefore part of the ompi_info_t and not the opal_info_t The data structure changes are primarily in the following files: communicator/communicator.h ompi/info/info.h ompi/win/win.h ompi/file/file.h The following new files were created: opal/util/info.h opal/util/info.c opal/util/info_subscriber.h opal/util/info_subscriber.c This infosubscriber concept is that communicators, files and windows can have subscribers that subscribe to any changes in the info associated with the comm/file/window. When xxx_set_info is called, the new info is presented to each subscriber who can modify the info in any way they want. The new value is presented to the next subscriber and so on until all subscribers have had a chance to modify the value. Therefore, the order of subscribers can make a difference but we hope that there is generally only one subscriber that cares or modifies any given key/value pair. The final info is then stored and returned by a call to xxx_get_info. The new model can be seen in the following files: ompi/mpi/c/comm_get_info.c ompi/mpi/c/comm_set_info.c ompi/mpi/c/file_get_info.c ompi/mpi/c/file_set_info.c ompi/mpi/c/win_get_info.c ompi/mpi/c/win_set_info.c The current subscribers where changed as follows: mca/io/ompio/io_ompio_file_open.c mca/io/ompio/io_ompio_module.c mca/osc/rmda/osc_rdma_component.c (This one actually subscribes to "no_locks") mca/osc/sm/osc_sm_component.c (This one actually subscribes to "blocking_fence" and "alloc_shared_contig")
5dcaa3f
to
35137a8
Compare
@dsolt Have you been able to run all these through the IBM info tests, and any other info tests that you have? Have you been able to write a test or three showing that the new infrastructure inside OMPI works, beyond just using INFO_SET/GET? |
Not yet, but soon... hopefully this week. From: Jeff Squyres [email protected] @dsolt Have you been able to run all these through the IBM info tests, and |
Mark Allen has agreed to test this new functionality (next week), and possibly write a few new tests. If he does write new tests, should he push them to MTT? |
Yes, please put them in the ibm/info directory in the ompi-tests repo. We'll need to add a little configury to make sure they don't fire for the v1.10 series, but that's easy enough. |
Ok, so some code walk through and testing discussion led to the realization that the previous MPI_Comm_dup and MPI_Comm_dup_with_info both ignore the whole info concept entirely. So, there is some work to do yet to fix those (and MPI_Win_dup, MPI_File_dup). So, questions on what the MPI Standard intends for MPI_Comm_dup to do: MPI_Info_set(myinfo, "use_shared_memory", "yes");
MPI_Comm_set_info(comm, myinfo);
MPI_Comm_get_info(comm, myinfo_returned);
/// Assume here that there was not enough shared memory available to use shared-memory for
// communication on this comm, so MPI_Info_get() tells us that use_shared_memory is "no".
MPI_Comm_free(&anothercomm);
/// now there is more shared-memory available
MPI_Comm_dup(comm, &newcom);
MPI_comm_get_info(newcom, myinfo_returned);
// WHAT SHOULD THIS DO? WHAT SHOULD BE "PRESENTED" TO NEWCOM Should the info on the newcomm be "no" since it was "no" on its parent or should it be "yes" since MPI_Comm_set_info() was used to set "yes" for shared-memory on its parent? Consider the opposite, the original comm was asked to use shared-memory and it did, but new does not have enough shared-memory available to use shared-memory, should it return "yes" or "no"? It seems crazy that internally it isn't using shared-memory (because it wasn't available), but it is going to return "yes" just because MPI_Comm_get_info on its parent would return "yes" for "use_shared_memory". So, the boils down to two questions: Should MPI_Comm_dup blindly duplicate the key/value pairs of the previous communicator or should they be presented to the new communicator and the subscribers will determine what the get should return. I'm pretty sure for this we want to present values to the new communicator, not just blindly replicate them. Assuming they are presented, should the key/values that were set on the original communicator be presented to the new communicator or should the key/values that get would return on the original communicator be presented to the new communicator? I really hope we can go with the get option so we don't have to store separate set and get values. |
I guess I should say that I actually think that during a Dup, the MPI_Comm_set_info values of the parent should be presented to the new communicator, not the MPI_Comm_get_info values of the parent comm. I don't like that answer since it means we have to store the set values for the purpose of dup, but it seems like the right thing to do. |
When the MPI 3.1 standard describes MPI_Comm_dup() and MPI_Comm_dup_with_info() it twice implies that MPI_Comm_dup() will just blindly copy the info of its parent without subscriber callbacks modifying the values. But then in 6.4.4 "Communicator Info" it finally uses language that supports the more logical interpretation, where it says "when the communicator is duplicated using MPI_COMM_DUP or MPI_COMM_IDUP. In this case, all hints associated with the original communicator are also applied to the duplicated communicator". That key phrase "applied to" is what I want to see in the other two places, making it clear that subscriber-callbacks process the parent's info at MPI_Comm_dup() time rather than just duplicating them. I think it's logical that in the case of MPI_Comm_dup(), the paren'ts current info is inherited to be used as the starting point, which then goes through the callbacks to become the new comm's potentially different info. And in the case of MPI_Comm_dup_with_info() the input info argument is the starting point. That's an interesting idea, about saving all original info argument values rather than just using the current info values of the parent communicator, but I don't think that's what the standard is saying to do. Besides, I think MPI_Comm_dup_with_info() is the call that would be used by people who care about the info settings. |
As far as making tests, we haven't done much yet. If we're trying to specifically test the concept of components subscribing to info keys and potentially modifying them, we have the problem that MPI is allowed to keep a value, change it, or throw it away. So the testcase has no basis for deciding if MPI is behaving correctly or not. I'm tempted to make the testcase directly use opal_infosubscribe_subscribe() to add its own callback to be triggered at MPI_Comm_set_info() time, but that definitely woudn't be a standards compliant test. Or we could just write general tests using the MPI calls that were touched, and see if there are any catastrophic failures in the general areas of those functions. |
MPI_Comm_set_info is defined as a collective call. My understanding is that if we allow the new communicator to reevaluate the inherited hints, we basically force another collective upon communicator creation (one based on data with unknown size). |
MPI_Comm_set is collective, but does not necessarily mean it has to communicate. Our current implementation does not barrier and this PR doesn't introduce a barrier. But it is true that when presenting the hints to the underlying subscribers, the individual subscriber components may choose to do a collective operation which could be costly. |
@dsolt Pro github tip: prefix blocks of C code with a line of exactly three single-ticks and the letter "c" (the three single ticks tell Github Markdown to make it be a verbatim / fixed-width section; the letter "c" means to syntax highlight the block with C-style syntax) and suffix the block with a line of exactly three single ticks (to close the verbatim block). I edited your comment from a few days ago to add this to make it a bit more readable. |
This sounds like a thorny topic -- perhaps we should discuss this in Dallas next week? I'll add it to the agenda. One whacky idea while I'm thinking about it (perhaps to be discussed next week?): @dsolt correctly identifies a dichotomy between the user-specified values and the OMPI-returned values -- it creates an ambiguity in
In this way, when we Admittedly, this is a workaround for an unclear MPI standard issue. Perhaps the real fix will need to be debated up in the Forum...? |
@markalle Your approach to testing this sounds correct (i.e., it may be more of a unit test of this infrastructure vs. an actual MPI application). |
I like the idea of using prefix/suffixes. It seems like the easiest method for maintaining a copy of the incoming value set by the user, separate from the potentially modified value processed by callbacks. I'm tempted to do the reverse of your example though, eg keep a "ompi_saved_[keyname]" key with the unmodified user-input and let the original key/value be modified. This seems more intuitive for the user who requests some setting and then queries to see what they received. Nothing in the standard currently would clue them in to look for "[keyname]_provided". And to avoid a bunch of nuisance-code related to string-length, maybe only save the above extra key/val pair for recognized ompi keys which by convention we'd keep far enough below MPI_MAX_INFO_KEY to ensure room for a prefix/suffix. |
@markalle Fair enough -- you're probably right that a user would expect to SET a key and then GET the same key back to see what happened. I also agree that we only need to keep values for keys that OMPI recognizes -- I think there's standard language that supports this, about how MPI_COMM_INFO_GET will only return hints that are "being used," or somesuch (i.e., if OMPI doesn't recognize it, it won't use it, so there's no point in storing it). |
The standard mandates the length of the keys. By adding a prefix or suffix we might, depending on the length of the user keys, need more than this predefined length. How do we deal with the worst case where we will need to truncate. |
I guess we can wait then. I took the mpi-standard statement |
Back to the testing topic: where should unit tests go for Open MPI? If the API ends up simple enough I can just extern a couple functions and call them from an independent program that's just linked against libmpi.so I might try that, but if it involves essentially making a test that lives inside the Open MPI library, would I make it something that can be #ifdefed or maybe have the test activiated by environment var? Is there a mechanism / precedent to follow? |
@markalle The Forum is arguing over that exact statement. You can follow the thread: http://lists.mpi-forum.org/mpi-forum/2016/02/3173.php @markalle Regarding testing: we have 2 general areas for tests. Small unit tests live inside the test/ tree in the OMPI tree itself, and are activated by running "make check" (and they're run automatically when we make official distribution trarballs). Outside of that, we put stuff in the ompi-tests repo. We originally gathered a bunch of test suites in there, and then started supplementing those test suites with our own additions and whatnot. Probably the one that has grown the most with our own custom tests is the IBM test suite (which is MPI API-level tests, not unit tests of OPAL or ORTE). |
As a reminder, @jsquyres please discuss with MPI Forum and try to drive to some conclusion. Oh, and herd some cats into a bag while you're at it. |
Due to an extraordinarily-long agenda at the Forum this week, we did not get to fully address this issue. 😦 Here's what we got:
So it's now up to us (and possibly @jdinan, if he cares) to propose what MPI_COMM_SET_INFO and MPI_COMM_GET_INFO should do. We should probably all carefully read what MPI-3.1 says and then make a proposal. Bonus points will be awarded if the proposal can be an interpretation of the current text, but it's quite possible (likely?) that the current text is simply totally borked, and we'll just have to propose new text that isn't an interpretation of the current text. |
I'm okay with MPI_Comm_dup() inheriting nothing, as that would be in line with all the other comm creation routines. The code is pretty easy to swap between various interpretations if we need to later as well. Right now there is quite a bit of extra code that allows values to change while a backup of the original remains behind, and that's unnecessary baggage under the current interpretation. But I kind of hate to rip it out. Right now you get what you set, but the code is taking a round-about method of making that happen. Internally it has both the original and the modified settings lying around. |
I wouldn't rip it out yet -- I think there's still a question of what to do with MPI_COMM_GET_INFO. I voiced my opinion up above (see #1308 (comment)) -- do you agree?
True (no, you're not behind the times).
Good point. I'll pass this on to the RMA working group. |
@jsquyres, sure, I'm happy to help with the spec work. |
I'm not sure if this discussion should continue in this PR or be moved
Would this mean that MPI_Comm_dup_with_info is identical to MPI_Comm_dup + I agree with your interpretation of what "SET" should do (replace, not Dave EDIT: Lotus notes really does horrid things when replying to github issues... |
For some info hints, it may only be possible to apply them when the communicator is created (i.e. there's no pending communication since the communicator hasn't been created yet). |
My least favorite aspect of the current code (which I have a testcase for and which finally has no memory leaks, but for which I still need to create a pull request) is the location from which the
|
@markalle, for MPI_COMM_IDUP the info should be copied as early as possible in the duplication process, preferably before we return from MPI_COMM_IDUP (or we will need some versioning mechanism). Doing it during the call to activate might be too late, I would rather move it up in ompi_comm_idup_internal (right after ompi_comm_set_nb?). |
Okay, I changed it again. Previously the infosubscriber design wouldn't have worked well with duping the info as early as possible, because all activation of the callbacks including the initial activation was tied to the setting of the info. So that design would have resulted in setting the comm's info early, walking a non-existent list of callbacks, and then later having modules register callbacks they want to have triggered.
|
Can you rebase on master now that the mpool/rcache rewrite is in? |
Build Failed with XL compiler! Please review the log, and get in touch if you have questions. |
This PR was discussed today in the OMPI face-to-face meeting in Dallas. Conclusions:
This allows other parts of OMPI to start using |
|
:bot:retest |
Build Failed with XL compiler! Please review the log, and get in touch if you have questions. Gist: https://gist.github.com/dbb1e3d9ac621d7f658b22f51a29c5f3 |
A proposal was just filed with the MPI Forum on the "What happens when you MPI_*_GET_INFO?" question: mpi-forum/mpi-issues#63 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a Signed-off-by line to this PR's commit.
Replaced by PR #2941 |
This is my first pull request, please give careful review. Not sure if this should be on a branch first? This change is intended for 2.1 and is a new model of handling MPI_Info changes to communicators/windows/files. It moves much of the info handling to the opal layer, but retains a wrapper for info objects that are exposed to the user. This allows us to use the info keyvalue concept internally and also implement MPI_Info using this internal concept.
The main driver for this change was to allow different components to see changes to MPI_Info (via MPI_Comm_set_info, etc) and have an opportunity to respond to those changes by either accepting the change and modifying their future behavior or modifying the proposed info change if they do not support it.
There are some scenarios which are difficult for this model. Particularly if multiple components (called subscribers in the code) use the same key but conflict on the values they support. Currently the key is presented to each component in turn and each has a chance to modify the value. The ordering therefore matters. The key (no pun intended) is to not do this. Each module should either use its own keys or they should agree on what values they can accept.
This code has received very minimal testing. I verified that no_locks values could be updated and the new value rejected when the underlying component does not support it. I can work on further testing, but would like some eyes on this as I do that to make sure the overall direction makes sense and that my approach to moving code around makes sense for the Open MPI architecture.
I would especially like @hjelmn to make sure that this will work for him.
@jsquyres @hjelmn @nysal @jjhursey @gpaulsen