-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An Ember problem causes Detailed Network test to fail on RedHat-7 VM and on MultiRank #108
Comments
John, So is it just how the redhat-7 vm handles write permissions by default. Vitus Sent from my iPhone On Feb 8, 2016, at 3:36 PM, John <[email protected]mailto:[email protected]> wrote: The test was running on all other tested platforms. The failure appears to have to do with over writing the motif-xx.log files. Reply to this email directly or view it on GitHubhttps://github.com//issues/108. |
I think this is really down to how SST Ember actually does file opening. There is a potential ordering effect here on who owns the final FILE*. Its more of a bug in Ember which is now getting exposed. |
John, can you try running this again and see if it restores old behavior at least? There is still an issue where a "job" spans multiple SST ranks. |
Nightly seems to have restored the old behavior. I’m VPN from home at the moment. Be in after a bit.
From: Si Hammond [mailto:[email protected]] John, can you try running this again and see if it restores old behavior at least? There is still an issue where a "job" spans multiple SST ranks. — |
John, Can you confirm the test passes on RedHat 7.X? Si Hammond Scalable Computer Architectures On 2/9/16, 8:09 AM, "John" [email protected] wrote:
|
The nightly FAILED on RedHat-7 as it had previously. It is looping on an empty motif-3.log file. The other platforms that run it are happy. John, Can you confirm the test passes on RedHat 7.X? Si Hammond Scalable Computer Architectures On 2/9/16, 8:09 AM, "John" [email protected] wrote:
— |
Nothing appears to have changed with respect to the file over write problem. |
John, I have lost track of this a little bit. The last thing I remember is Si saying the problem is not in the scheduler. Vitus Sent from my iPhone On Feb 19, 2016, at 10:54 AM, John <[email protected]mailto:[email protected]> wrote: Nothing appears to have changed with respect to the file over write problem. Reply to this email directly or view it on GitHubhttps://github.com//issues/108#issuecomment-186331237. |
This is an issue with Ember. :-( S. Si Hammond From: vjleung <[email protected]mailto:[email protected]> John, I have lost track of this a little bit. The last thing I remember is Si saying the problem is not in the scheduler. Vitus Sent from my iPhone On Feb 19, 2016, at 10:54 AM, John <[email protected]mailto:[email protected]mailto:[email protected]> wrote: Nothing appears to have changed with respect to the file over write problem. Reply to this email directly or view it on GitHubhttps://github.com//issues/108#issuecomment-186331237. Reply to this email directly or view it on GitHubhttps://github.com//issues/108#issuecomment-186334833. |
At this point this is a solid failure on the RedHat-7 VM, COERHEL-7, and and on all three Multi Rank n=2 nightly test, sst-test, El Capitan, Yosemite. |
@jpvandy - can you post the error into the issue please? Thank you! |
Ok, no problem, I already have it! Sorry for the spam! |
@afrodri : Si should evaluate whether this is still a 6.0.0 issue. |
Fixed. |
This is not fixed, it may be deferred, or it may be "not to be fixed", but it's not fixed. |
When the openmpi version was updated to 1.8.8, this problem became ubiquitous, across all platforms. It was fixed, with PR 498 on Nov. 8th. |
This issue is about an Ember file naming problem that breaks Scheduler Detailed Network test
Issue 147 is about an occasional Ember assert about "bufLen" buffer usage problem.
Issue 274 is about an occasional Ember Time Limit.
The test was running on all other tested platforms. The failure appears to have to do with over writing the motif-xx.log files.
(These failures can trigger another problem, the SQE timelimit processor really attempts to deal with a time limit for an SST run. the Detailed network test has a python wrapper that loops thru multiple sst invocations. Consequently, it can fail to terminate a wrapper loop.)
The failure is also observed with MULTI_RANK execution on the Serialization Branch
The text was updated successfully, but these errors were encountered: