Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARC: runner: mdb: tweak searching for cld process pid #30301

Merged

Conversation

evgeniy-paltsev
Copy link
Collaborator

@evgeniy-paltsev evgeniy-paltsev commented Nov 29, 2020

mdb binary starts several subproceses and one of them is cld process. In runners/mdb.py we record process id of cld on each mdb launch to terminate simulator correctly later. However we can finish test and terminate mdb before the cld process was found (so cld won't be terminated correctly by sanitycheck infrastructure). It may happen if we launch mdb on fast host machine.

That leads to several issues. First of all we get ugly error in sanitycheck output:

FileNotFoundError: [Errno 2] No such file or directory: '/xxxx/mdb.pid'

Secondly (and it's more important) we terminate simulator incorrectly. We terminate mdb leaving cld process alive, running and consuming one cpu core permanently (until we kill it manually)

So, let's don't wait extra 0.5 seconds before the first lookup (as test may finish before we start lookup) and increase granularity of lookups (so we won't have big delays where test can start and finish between lookups). It shouldn't affect host machine performance as we usually find cld (and hence exit lookup cycle) in 1-2 iterations (based on my tests on local machine and our servers)

Copy link
Collaborator

@abrodkin abrodkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things:

  1. Could you please add some background why did you get into that business of tweaking MDB runner (i.e. I guess there was some problem you were looking at)
  2. Move of the sleep() to the end of the loop is well understood, but what about 10 times smaller granularity now (50 milliseconds instead of 500)? Won't it introduce not-needed overhead?

@ruuddw
Copy link
Member

ruuddw commented Nov 30, 2020

Why is it so important to always have the pid file early? If there is no pid file, there was no process so nothing to kill -> suggest to move the pid check to where the process gets killed, instead of 'tweaking' parameters around development host machine timing/performance.

@evgeniy-paltsev
Copy link
Collaborator Author

@ruuddw @abrodkin I've updated commit description and added more background.

If there is no pid file, there was no process so nothing to kill -> suggest to move the pid check to where the process gets killed
But we get situation that there is no pid file or mdb process but cld is alive and consumes cpu power. And we can't determine which process with name cld to kill (as we launches several simulations in parallel)

I've played with a pstree util a bit.

Here is the simulation launched:

systemd───systemd───mdb───cld

Here we've killed mdb process (the cld parent):

systemd───systemd───cld

The cld still runs after his parent mdb is terminated.

mdb binary starts several subproceses and one of them is cld process.
In runners/mdb.py we record process id of cld on each mdb launch
to terminate simulator correctly later. However we can finish test
and terminate mdb before the cld process was found (so cld won't
be terminated correctly by sanitycheck infrastructure). It may happen
if we launch mdb on fast host machine.

That leads to several issues. First of all we get ugly error in
sanitycheck output:
------------------------>8--------------------------------
FileNotFoundError: [Errno 2] No such file or directory: '/xxxx/mdb.pid'
------------------------>8--------------------------------

Secondly (and it's more important) we terminate simulator incorrectly.
We terminate mdb leaving cld process alive, running and consuming one
cpu core permanently (until we kill it manually)

So, let's increase granularity of lookups and don't wait extra 0.5
seconds before the first lookup.

Signed-off-by: Eugeniy Paltsev <[email protected]>
@ruuddw
Copy link
Member

ruuddw commented Dec 1, 2020

Understood, without the mdb process running anymore it will hard to get the specific cld pid to kill.

@nashif nashif merged commit 9858893 into zephyrproject-rtos:master Dec 2, 2020
@mbolivar-nordic
Copy link
Contributor

@nashif I know I was late to the review but this should not have been merged with an optional import being added without checking if it's installed. I'll send a follow up fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants