Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for daemon to fully terminate in "stop" #276

Merged
merged 4 commits into from
Jan 30, 2020

Conversation

patrickfreed
Copy link
Contributor

Right now, mongo-orchestration stop on non-Windows platforms simply sends a signal to the mongo-orchestration daemon and exits immediately. In the background, the mongo child processes and the daemon itself will shut down after a brief period.

This PR updates the "stop" command to wait until the daemon has fully shut down before returning, effectively causing mongo-orchestration stop to block until both the daemon and the cluster are no longer running.

As part of this change, I added a dependency on psutil to facilitate polling whether the process is still alive or not. If adding a dependency is considered costly, that functionality could be reimplemented in mongo-orchestration. The library does seem like it could be useful in other areas of the codebase, however.

I was having trouble running the full test suite (with or without my changes) due to test_replica_sets.ReplicaSetAuthTestCase hanging, so unfortunately I do not know if this patch passes completely. I did test it manually and it seems to work, though.

@patrickfreed patrickfreed changed the title Wait for child mongo processes to terminate in "stop" Wait for daemon to fully terminate in "stop" Jan 21, 2020
os.kill(pid, SIGTERM)
process.send_signal(SIGTERM)
while process.is_running():
time.sleep(0.25)
Copy link
Collaborator

@ShaneHarvey ShaneHarvey Jan 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using os.kill(pid, 0) to check if the PID still exists instead of adding the psutil dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another problem with this approach. I just realized that when MO shuts down a cluster it also removes the server's log file and dbpath. If we merge this change then I would expect there to be no log file to upload in patch builds.

What if you copy the log files you're interested in before shutting down the cluster?

@patrickfreed
Copy link
Contributor Author

Are you sure that's the case? The log and db files seem to stick around for me after I shut the cluster down. When I start up a new cluster the old ones are deleted, however.

@ShaneHarvey
Copy link
Collaborator

Yes the order is:

When I run mongo-orchestration stop, the log files are gone after the daemon exits.

@patrickfreed
Copy link
Contributor Author

Hm, something must be messed up on my end. Well, given that, I don't think we can merge this patch, since it would seemingly break everyone's log storage. I think as you suggest we'll have to take an approach outside of mongo-orchestration.

I originally planned on doing this by adding a separate script to drivers-evergreen-tools that issues a `curl DELETE; doing that wouldn't cause mo to delete the log/data files too, right?

@patrickfreed
Copy link
Contributor Author

Also, what do you think about like a --preserveMongoData option that would toggle that cleanup behavior?

@ShaneHarvey
Copy link
Collaborator

DELETE would also remove the log files. MO deletes the files anytime it shuts down a cluster:

head /var/folders/lm/b1r2f8p503xg40r6x2rqv7fr0000gp/T/mongo-n3d4Jr/mongod.log
2020-01-21T14:36:25.702-0800 I  CONTROL  [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'
2020-01-21T14:36:25.703-0800 W  ASIO     [main] No TransportLayer configured during NetworkInterface startup
2020-01-21T14:36:25.708-0800 W  ASIO     [main] No TransportLayer configured during NetworkInterface startup
2020-01-21T14:36:25.709-0800 D1 NETWORK  [main] fd limit hard:524288 soft:262144 max conn: 209715
2020-01-21T14:36:25.709-0800 I  CONTROL  [initandlisten] MongoDB starting : pid=35243 port=27017 dbpath=/var/folders/lm/b1r2f8p503xg40r6x2rqv7fr0000gp/T/mongo-n3d4Jr 64-bit host=Shanes-MacBook-Pro-2.local
2020-01-21T14:36:25.709-0800 I  CONTROL  [initandlisten] db version v4.3.2-594-g24a8f59
2020-01-21T14:36:25.709-0800 I  CONTROL  [initandlisten] git version: 24a8f59e21490f4dd101224ba01872d773c7f25e
2020-01-21T14:36:25.709-0800 I  CONTROL  [initandlisten] allocator: system
2020-01-21T14:36:25.709-0800 I  CONTROL  [initandlisten] modules: enterprise
2020-01-21T14:36:25.709-0800 I  CONTROL  [initandlisten] build environment:
$ curl -XDELETE http://localhost:8889/servers/e82cbabf-e6e8-406a-be80-2e8ee9c0fcac
$ head /var/folders/lm/b1r2f8p503xg40r6x2rqv7fr0000gp/T/mongo-n3d4Jr/mongod.log
head: /var/folders/lm/b1r2f8p503xg40r6x2rqv7fr0000gp/T/mongo-n3d4Jr/mongod.log: No such file or directory

@ShaneHarvey
Copy link
Collaborator

I think a simpler solution might be to copy the files you're interested in before uploading them. Would that avoid the tar failures you're seeing?

Well, given that, I don't think we can merge this patch, since it would seemingly break everyone's log storage

It wouldn't break Python's log upload task because we upload the logs before shutting down mongo-orchestration:

  - func: "upload mo artifacts"
  - func: "upload test results"
  - func: "stop mongo-orchestration"

@mbroadst
Copy link
Member

@ShaneHarvey I think the whole problem is that those files might still be written to, and tar would error out because bytes were changing while it was trying to do its job. I think its reasonable that we try to find a more comprehensive solution to this problem.

@patrickfreed
Copy link
Contributor Author

patrickfreed commented Jan 23, 2020

Here's a patch confirming what @mbroadst said. I think most drivers run "upload-mo-artifacts" in post (which suppresses failures), so they don't see these errors. I was hoping that the logs being append-only while upload-mo-artifacts was being run would prevent the issue, but that appears to not be the case.

@ShaneHarvey
Copy link
Collaborator

Yeah I think it's fine to not cleanup the files after shutting down the clusters in the cleanup_storage signal handler.

@patrickfreed
Copy link
Contributor Author

So I discovered why the logs were apparently not being deleted on my machine:

In the code snippet you pasted above (cleanup_mprocess), it attempts to delete cfg["logPath"], but the log path is actually stored at cfg["logpath"], not in camel case (see here). By default, the log files are written to dbpath/mongod.log, so if you don't specify a log path it does usually get deleted. The configurations I use specify log paths, so the logs don't get deleted.

How about we make this behavior official and remove the attempted log path deletion from cleanup? The nice thing is since this won't have any behavioral changes (this bug has been in there for 7+ years), we know we won't break any evergreen configs. Then I won't need to modify anything else and can just use the blocking stop changes already made thus far in combination with a custom log path.

@ShaneHarvey
Copy link
Collaborator

SGTM. Can you open a follow-up issue to not delete the default logpath (dbpath/mongod.log) on shutdown too?

Copy link
Collaborator

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShaneHarvey
Copy link
Collaborator

One note. I've noticed that shutting down clusters can take a long time, especially with newer server versions. If that proves to be the case in Evergreen, we may need to optimize stop to shutdown the clusters forcefully with SIGTERM instead of using the shutdown command.

@patrickfreed
Copy link
Contributor Author

I wasn't able to run the entire test suite locally due to similar issues found in #273, but I did run drivers-evergreen-tools and the swift driver's evergreen's with this patch and they passed fine. As you pointed out, the tasks that use latest sharded clusters can take up to 30s longer now that it waits for shutdown, but in my experience that is usually only a small percentage of the total time a task takes to run. Do you think that's sufficient testing or should I try some more drivers out?

@ShaneHarvey
Copy link
Collaborator

Yes I think that's fine. I opened #278 to make mongo-orchestration stop faster.

@patrickfreed
Copy link
Contributor Author

It looks like I don't have write access to this repository, so could you squash/merge it for me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants