Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All commands timeout after a couple hours #16

Closed
jjathman opened this issue Jan 17, 2019 · 15 comments
Closed

All commands timeout after a couple hours #16

jjathman opened this issue Jan 17, 2019 · 15 comments

Comments

@jjathman
Copy link

I'm just starting trying to use Vault with the Oracle DB plugin. I'm able to get it configured and working correctly. Vault can create new users, I'm able to renew the leases for those users, and I am able to revoke them.

However, after a couple hours all commands that interact with the Oracle plugin start timing out and failing with RPC errors. The Oracle plugin process is still running. I'm not sure if the issue is within the vault plugin process, or vault itself.

Hoping for some help troubleshooting what the issue could be. I don't see anything in the logs to indicate there is a problem.

Restarting vault (and the plugin process) always immediately solves the problem.

Vault version 1.0.1
Plugin version 0.1.4

@jjathman
Copy link
Author

jjathman commented Jan 18, 2019

More details from today. I had an app running overnight which successfully renewed leases all evening. This morning I tried to shut the app down which revokes the user. This was not successful. Here is the end of the log.

Jan 18 10:34:58 mdl-test0001x vault: 2019-01-18T10:34:58.466-0600 [TRACE] secrets.database.database_9f4e0cf1.vault-plugin-database-oracle: revoke user: transport=gRPC status=finished err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" took=1m29.99800143s
Jan 18 10:34:58 mdl-test0001x vault: 2019-01-18T10:34:58.466-0600 [ERROR] expiration: failed to revoke lease: lease_id=database/creds/microservice-rw/154hHxWfzms0xA9lJx233mJu error="failed to revoke entry: resp: (*logical.Response)(nil) err: rpc error: code = DeadlineExceeded desc = context deadline exceeded"

At this point forward I am no longer able to create new credentials.

@gerrat
Copy link

gerrat commented May 15, 2019

We have what appears to be the same issue.
The error message is the same.

Additionally, we cannot revoke the lease though matter what we try (force doesn't help).
Restarting vault has not helped anything.
We have leases we cannot revoke, and that appears to be preventing us from creating any new ones.
We were using an older version (0.11.5), so we're going to try upgrading and see if these leases can be revoked.

@meshantz
Copy link

meshantz commented May 16, 2019

I've been working with @gerrat on this problem. We updated to vault 1.1.2 and 0.1.5 on the plugin, and it still did not help with revoking the lease.

At this point we think we've pinned the bug down as a permissions issue for the backend connection's user:

We were able to select sessions but not revoke them.

The connection user needs to be able to execute both of these:

SELECT sid, serial#, username FROM v$session WHERE username = UPPER('{{name}}')

ALTER SYSTEM KILL SESSION '%d,%d' IMMEDIATE

It's the second one we don't seem to have.

@meshantz
Copy link

I'm going to hazard the following guess at what happens in the plugin, though I don't have the specific setup needed to reproduce.

I see that in the RevokeUser handler it begins by acquiring a lock (on the plugin?).

After that it attempts to disconnect the session. With our permissions setup it happily selects the sessions, and then gets a permission denied when trying to revoke.

At that point it either never returns, or throws an error without releasing the lock. I'm betting the second is the case.

If I'm correct, this error only presents when you don't have the ALTER SYSTEM permission AND there is an active session.

So the workaround would be to:

  1. disconnect all sessions for the user
  2. restart vault (to get rid of the lock).

@jefferai
Copy link
Member

The defer statement means it releases the lock whenever the function returns, whether success or not.

@meshantz
Copy link

I've been learning my way through go syntax and capabilities over the past few days and I saw that pretty quickly - though I wasn't sure how it behaved exactly in this context. Thanks for the confirmation.

We've still got this problem, even with what appear to be correct permissions at this point. We definitely need to restart vault after disconnecting sessions, so the lock is not being released. If the only way it doesn't release is if the function doesn't exit, then I'm lead to believe that the oracle client is never returning.

Perhaps this is a mismatch in the oracle client version... I realize we've been using the pre-built linux binary (with version 12.2?) and our vault server has the 12.1 client installed.

@jefferai I can't find anything that definitively says what client the pre-built binary uses. The README suggests that it is 11.2, and the build script looks like it is 12.2. Can you confirm that it is the 12.2 oracle client (at least for the v0.1.5 of the plugin)?

@jjathman have you been using the pre-built binary, or did you build the plugin against the exact version your vault server is using?

@jefferai
Copy link
Member

We definitely need to restart vault after disconnecting sessions, so the lock is not being released.

What makes you so sure that the issue is the lock?

We've seen similar types of errors with configurations either on a firewall or the third party server (in this case Oracle) being configured to drop a connection that seems idle. Especially in the case of a firewall, dropping often means black-holing traffic rather than rejecting a connection outright (as once it stops tracking the connection it treats it as any other unauthorized connection, which often means black-holing). This can lead the client to keep sending packets that will never go anywhere and keep retrying when it gets no response, which causes the client to essentially hang.

You could also reload the plugin rather than restarting Vault entirely.

@meshantz
Copy link

I've been thinking the lock makes sense as an explanation for why no other transaction is possible after the initial attempt, even though we see vault re-trying the revocation. We enabled some auditing on the database, and the only action done by the connection user is the initial SELECT sid, serial#, username FROM v$session WHERE username = UPPER('{{name}}').

Your explanation could make sense as well, though there shouldn't be an idle timeout between the SELECT and the ALTER. But if the connection is getting closed for any reason, black-holing would likely be the issue.

I don't actually think the lock is the problem, just a symptom of the problem. I'm continuing to dig into this, and I'll let you know how it goes, but it's slow going since I'm not familiar with go...

Your help is much appreciated. Thanks!

@meshantz
Copy link

I've pinned our version of this down. I added a bunch of logging in to see how far through I get, and I can see it claim the lock, execute the session SELECT and issue the ALTER SYSTEM, but never complete. Subsequent retries get blocked waiting on the lock.

I hacked together a single run-through which assumed exactly one session existed, and instead of deferring rows.Close(), ran that before attempting to issue the ALTER SYSTEM call. This ran and returned just fine.

So it appears the client won't accept another call until the rows have been closed. I've verified this with 12.1 and 12.2.

@meshantz
Copy link

After reading up on the database package, I found an explanation here that shows how connections should work during a loop like the one we're using:

rows, err := db.Query("select * from tbl1") // Uses connection 1
for rows.Next() {
	err = rows.Scan(&myvariable)
	// The following line will NOT use connection 1, which is already in-use
	db.Query("select * from tbl2 where id = ?", myvariable)
}

However, in this case, prior to entering disconnectSession we've grabbed a transaction, which presumably ties up another connection, so this is really connections 2 and 3.

But, the default value for max_open_connections is 2

Updating our configuration to use max_open_connections=3 solves the problem.

Is there a place that would be good to document that this plugin requires a higher value to be set in order to revoke active sessions? Is the README here sufficient?

@gerrat
Copy link

gerrat commented May 24, 2019

If the default max_open_connections for Oracle isn't high enough for the official Oracle plugin to work properly, the default is too low.

@meshantz
Copy link

I've opened a ticket upstream. The sdk changes will need to be pulled in here once that is addressed.

Documentation here will still help in the meantime.

@jjathman
Copy link
Author

Just checking back in, is this issue fixed now? I see there was a release in November. Sorry I haven't had an opportunity to try this out myself yet.

@meshantz
Copy link

Provided you are running a version of vault >= 1.2.0, you should no longer need to set max_open_connections.
See the vault changelog.

Assuming that is the root cause of your timeouts, I believe this to be fixed upstream.

@jjathman
Copy link
Author

OK great, that matches my expectations as well. I will just close this for now then. I can always reopen it if I see there is an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants