Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An Ember problem causes Detailed Network test to fail on RedHat-7 VM and on MultiRank #108

Closed
jpvandy opened this issue Feb 8, 2016 · 16 comments
Assignees
Milestone

Comments

@jpvandy
Copy link
Contributor

jpvandy commented Feb 8, 2016

This issue is about an Ember file naming problem that breaks Scheduler Detailed Network test
Issue 147 is about an occasional Ember assert about "bufLen" buffer usage problem.
Issue 274 is about an occasional Ember Time Limit.

The test was running on all other tested platforms. The failure appears to have to do with over writing the motif-xx.log files.
(These failures can trigger another problem, the SQE timelimit processor really attempts to deal with a time limit for an SST run. the Detailed network test has a python wrapper that loops thru multiple sst invocations. Consequently, it can fail to terminate a wrapper loop.)

The failure is also observed with MULTI_RANK execution on the Serialization Branch

@vjleung
Copy link
Collaborator

vjleung commented Feb 8, 2016

John,

So is it just how the redhat-7 vm handles write permissions by default.

Vitus

Sent from my iPhone

On Feb 8, 2016, at 3:36 PM, John <[email protected]mailto:[email protected]> wrote:

The test was running on all other tested platforms. The failure appears to have to do with over writing the motif-xx.log files.
(These failures can trigger another problem, the SQE timelimit processor really attempts to deal with a time limit for an SST run. the Detailed network test has a python wrapper that loops thru multiple sst invocations. Consequently, it can fail to terminate a wrapper loop.)

Reply to this email directly or view it on GitHubhttps://github.com//issues/108.

@nmhamster
Copy link
Contributor

I think this is really down to how SST Ember actually does file opening. There is a potential ordering effect here on who owns the final FILE*. Its more of a bug in Ember which is now getting exposed.

@nmhamster
Copy link
Contributor

John, can you try running this again and see if it restores old behavior at least? There is still an issue where a "job" spans multiple SST ranks.

@nmhamster nmhamster added this to the SST 6.0.0 milestone Feb 9, 2016
@jpvandy
Copy link
Contributor Author

jpvandy commented Feb 9, 2016

Nightly seems to have restored the old behavior.

I’m VPN from home at the moment. Be in after a bit.
Was just trying to run on my sandbox of yesterday, when your email came in

John

From: Si Hammond [mailto:[email protected]]
Sent: Tuesday, February 09, 2016 8:06 AM
To: sstsimulator/sst-elements
Cc: Vandyke, John P
Subject: [EXTERNAL] Re: [sst-elements] The scheduler Detailed Network test fails on RedHat-7 VM (#108)

John, can you try running this again and see if it restores old behavior at least? There is still an issue where a "job" spans multiple SST ranks.


Reply to this email directly or view it on GitHubhttps://github.com//issues/108#issuecomment-181906422.

@nmhamster
Copy link
Contributor

John,

Can you confirm the test passes on RedHat 7.X?

Si Hammond

Scalable Computer Architectures
Sandia National Laboratories, NM, USA

On 2/9/16, 8:09 AM, "John" [email protected] wrote:

Nightly seems to have restored the old behavior.

I’m VPN from home at the moment. Be in after a bit.
Was just trying to run on my sandbox of yesterday, when your email came
in

John

From: Si Hammond [mailto:[email protected]]
Sent: Tuesday, February 09, 2016 8:06 AM
To: sstsimulator/sst-elements
Cc: Vandyke, John P
Subject: [EXTERNAL] Re: [sst-elements] The scheduler Detailed Network
test fails on RedHat-7 VM (#108)

John, can you try running this again and see if it restores old behavior
at least? There is still an issue where a "job" spans multiple SST ranks.


Reply to this email directly or view it on
GitHub<#108 (comment)
t-181906422>.


Reply to this email directly or
view it on GitHub
<#108 (comment)
08291>.

@jpvandy
Copy link
Contributor Author

jpvandy commented Feb 9, 2016

The nightly FAILED on RedHat-7 as it had previously. It is looping on an empty motif-3.log file.
After updating ember, I get the same result when I run it on the VM.

The other platforms that run it are happy.
John
From: Si Hammond [mailto:[email protected]]
Sent: Tuesday, February 09, 2016 2:01 PM
To: sstsimulator/sst-elements
Cc: Vandyke, John P
Subject: [EXTERNAL] Re: [sst-elements] The scheduler Detailed Network test fails on RedHat-7 VM (#108)

John,

Can you confirm the test passes on RedHat 7.X?

Si Hammond

Scalable Computer Architectures
Sandia National Laboratories, NM, USA

On 2/9/16, 8:09 AM, "John" [email protected] wrote:

Nightly seems to have restored the old behavior.

I’m VPN from home at the moment. Be in after a bit.
Was just trying to run on my sandbox of yesterday, when your email came
in

John

From: Si Hammond [mailto:[email protected]]
Sent: Tuesday, February 09, 2016 8:06 AM
To: sstsimulator/sst-elements
Cc: Vandyke, John P
Subject: [EXTERNAL] Re: [sst-elements] The scheduler Detailed Network
test fails on RedHat-7 VM (#108)

John, can you try running this again and see if it restores old behavior
at least? There is still an issue where a "job" spans multiple SST ranks.


Reply to this email directly or view it on
GitHub<#108 (comment)
t-181906422>.


Reply to this email directly or
view it on GitHub
<#108 (comment)
08291>.


Reply to this email directly or view it on GitHubhttps://github.com//issues/108#issuecomment-182066273.

@jpvandy
Copy link
Contributor Author

jpvandy commented Feb 19, 2016

Nothing appears to have changed with respect to the file over write problem.
(The SQE problem of this particular infinite loop has been corrected in the Test Suite.)
Is this a scheduler bug or an ember bug?

@vjleung
Copy link
Collaborator

vjleung commented Feb 19, 2016

John,

I have lost track of this a little bit. The last thing I remember is Si saying the problem is not in the scheduler.

Vitus

Sent from my iPhone

On Feb 19, 2016, at 10:54 AM, John <[email protected]mailto:[email protected]> wrote:

Nothing appears to have changed with respect to the file over write problem.
(The SQE problem of this particular infinite loop has been corrected in the Test Suite.)
Is this a scheduler bug or an ember bug?

Reply to this email directly or view it on GitHubhttps://github.com//issues/108#issuecomment-186331237.

@nmhamster
Copy link
Contributor

This is an issue with Ember. :-(

S.

Si Hammond
Scalable Computer Architectures
Center for Computing Research
Sandia National Laboratories, NM, USA
[Sent from Remote Connection, Please excuse typos]

From: vjleung <[email protected]mailto:[email protected]>
Reply-To: sstsimulator/sst-elements <[email protected]mailto:[email protected]>
Date: Friday, February 19, 2016 at 11:05 AM
To: sstsimulator/sst-elements <[email protected]mailto:[email protected]>
Cc: Simon Hammond <[email protected]mailto:[email protected]>
Subject: [EXTERNAL] Re: [sst-elements] The scheduler Detailed Network test fails on RedHat-7 VM (#108)

John,

I have lost track of this a little bit. The last thing I remember is Si saying the problem is not in the scheduler.

Vitus

Sent from my iPhone

On Feb 19, 2016, at 10:54 AM, John <[email protected]mailto:[email protected]mailto:[email protected]> wrote:

Nothing appears to have changed with respect to the file over write problem.
(The SQE problem of this particular infinite loop has been corrected in the Test Suite.)
Is this a scheduler bug or an ember bug?

Reply to this email directly or view it on GitHubhttps://github.com//issues/108#issuecomment-186331237.

Reply to this email directly or view it on GitHubhttps://github.com//issues/108#issuecomment-186334833.

@jpvandy jpvandy changed the title The scheduler Detailed Network test fails on RedHat-7 VM An Ember problem causes Detailed Network test to fail on RedHat-7 VM and on MultiRank Apr 6, 2016
@jpvandy
Copy link
Contributor Author

jpvandy commented Apr 23, 2016

At this point this is a solid failure on the RedHat-7 VM, COERHEL-7, and and on all three Multi Rank n=2 nightly test, sst-test, El Capitan, Yosemite.

@nmhamster
Copy link
Contributor

@jpvandy - can you post the error into the issue please? Thank you!

@nmhamster
Copy link
Contributor

Ok, no problem, I already have it! Sorry for the spam!

@jwilso
Copy link
Contributor

jwilso commented Jun 29, 2016

@afrodri : Si should evaluate whether this is still a 6.0.0 issue.

@nmhamster
Copy link
Contributor

Fixed.

@jpvandy
Copy link
Contributor Author

jpvandy commented Jul 14, 2016

This is not fixed, it may be deferred, or it may be "not to be fixed", but it's not fixed.

@jpvandy jpvandy reopened this Jul 14, 2016
@jpvandy jpvandy modified the milestones: Future, SST v6.0.0 Jul 14, 2016
@jpvandy
Copy link
Contributor Author

jpvandy commented Nov 9, 2016

When the openmpi version was updated to 1.8.8, this problem became ubiquitous, across all platforms. It was fixed, with PR 498 on Nov. 8th.

@jpvandy jpvandy closed this as completed Nov 9, 2016
@jpvandy jpvandy modified the milestones: SST v6.1.0, Future Nov 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants