-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MySQL HA Backend Support #4686
MySQL HA Backend Support #4686
Conversation
This adds support for HA mode when using the MySQL backend. This needs some work still to clean up some ugly code and some tests added. However in it's current state this will failover fine to a secondary node.
Due to the need for checking if a lock is held even in the case when you don't have a lock open I have moved the code for this up into the MySQLHABackend struct.
This lets the world know that you can use MySQL as an HA backend! It also removes some code I commented out when I was getting this working and forgot to delete.
This correctly only passes in one argument which is the lock name to the mysql RELEASE_LOCK function.
It looks like the one test is consistently failing however I am unable to tell what exactly is going on. I ran the |
Caught this when testing the step-down function and ensuring that it works. It seems that about 50% of the time the lock will be returned to the current node. I think the 5 second sleep may be the reason for this although I would still expect the stanby node to take over since after step-down the former leader sleeps for 10 seconds.
This closes the MySQL connection instead of trying to release the lock. For whatever reason when calling RELEASE_LOCK mysql would just hang even when setting a timeout. I could find no documentation on this behavior online. Closing the connection completely will 100% of the time free the lock.
This removes some useless comments and also adds comments for code that is not super clear or is very specific to MySQL.
Found a bug since my local testing missed this where I'm not returning the leader UUID. Will push up a change to address this shortly. |
This properly find the node that is the leader and forwards requests to it. Previously requests would not be forwarded to the active leader and would just 500.
The latest commit fixes an issue since I didn't fully understand how the Value() HA function was working. This now properly forwards requests onto the leader. This has a fair amount of cleaning up to do but would like to get some eyes on it as this turned out to be a bit larger PR than I thought it would be 😆 |
This will wait until after we have given it a decent try and removing the active leader and then close the connection...
Don't log to stdout anymore... cleanup debug info used to track why a lock wasn't being found before.
Seems the unit tests were working... as is usually the case :) |
I believe that I need to move my table that I created for the lock to use the default datastore that is encrypted. |
The lock definitely needs to be encrypted! Vault should do this automatically for you though...? |
That is correct I wasn't thinking when I wrote that code though :| so i'll fix that up before wasting anyones time looking over this. |
Just to keep people in the loop I am still going to make the changes mentioned above but since I need to use this at work at the same time for another project the current posted diff is actively being used by me in a pre-prod environment. Hope to get the changes above done shortly. |
This was causing a large load on the DB since it was executed very often and kept having to prepair the query.
Took a look last night on how to refactor this so we are storing the leader uuid encrypted as well. You mentioned encrypting the lock which although it may be possible i'm not sure how much it will buy us. The lock is just a simple key i believe core/lock or something similar that is stored in the general mysql system tables since you can't use GET_LOCK and it's sister functions with a specific table. |
It also seems like the Value() Lock function gets called a lot i'm seeing 27 selects per second just from this call alone in a cluster size of three does that seem correct or possibly I messed something up in my implementation. EDIT: never mind me I see it's a call I am making :) fixing it now. |
I'm sure you all are busy but just want to make sure that you aren't waiting on me for anything currently. Small update this code is running in prod now for the past week without issue. |
From my perspective the nicest thing would be to get other MySQL users to test it out and/or review the code -- it's not one we support. We can do a quick look-see and see if it seems sane to us and merge it, but a lot of the HA stuff ends up being subtle, and |
Thanks @jefferai i'll send it out to the mailing list. |
Hi, @michael-k I wanted to cooperate with the test because I was longing for this function! |
Awesome glad to hear. Let me know if you need help with anything. |
I tried to your function.
instance1.hcl is following.
mysql tables are below.
Is there anything wrong with my settings:(? |
The error you are getting is specific to your MySQL backend Hopefully this is reproducible with the default setting in the https://hub.docker.com/_/percona/ image. |
You will need to set |
I see. I changed the configuration of cluster from active/active to active/standby. I confirmed the operation of the ten steps you presented. |
https://mariadb.com/kb/en/library/get_lock/ - get lock works on a server wide bases which is broken in all active clusters. Here is also a good wiki that was written on the issue by another company that ran into this limitation https://docs.ubersmith.com/display/UbersmithDocumentation/Clustering+Databases. |
Thanks for description:+1: |
|
||
// Get or set MySQL server address. Defaults to localhost and default port(3306) | ||
address, ok := conf["address"] | ||
if !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this block doesn't have a closing bracket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha thanks should have looked here a few minutes ago while I was getting fun errors from go fmt like this.
mschuett-C02TC2VYGTDY:vault schuettm$ go fmt ./physical/mysql/mysql.go
physical/mysql/mysql.go:257:37: missing ',' in argument list
physical/mysql/mysql.go:259:2: missing ',' in argument list
physical/mysql/mysql.go:260:3: expected operand, found 'return'
physical/mysql/mysql.go:261:3: missing ',' before newline in argument list
physical/mysql/mysql.go:262:21: expected '==', found '='
physical/mysql/mysql.go:263:2: expected operand, found 'return'
physical/mysql/mysql.go:264:1: expected operand, found '}'
physical/mysql/mysql.go:268:2: missing ',' in argument list
physical/mysql/mysql.go:270:24: missing ',' before newline in argument list
physical/mysql/mysql.go:271:2: expected operand, found 'defer'
physical/mysql/mysql.go:273:9: missing ',' in argument list
exit status 2
Just an update on this. This is running in prod now and I haven't seen any issues with Vault using this for the past month. It's under fairly decent load at all times and I still have it set to failover to a new leader every 8 hours per the max connection time in mysql which breaks the lock just to ensure all the code around acquiring locks is working as expected. Ran into some other issues but that was just because I didn't read the docs around periodic tokens well enough 😆 😭 . Let me know if I can do anything to help push this along. Thanks, |
Thanks! |
@jefferai thanks a ton for merging this in! |
Thanks for submitting it! Sorry about the delay, we're just in mad crunch mode for 0.11 |
we have problems with HA with Mysql backend. could you please review the logs ? |
Hey, thanks for using the MySQL HA backend. Could you open a ticket and just ping me on it and I'll take a look. Some other information for others filing tickets against this in the future that is helpful is the MySQL flavor that you are using as well as the environment that you are running in (docker, aws, vagrant, whatever). If you have a docker-compose that is reproducible that is even better. From briefly looking at your log this looks like you have exhausted some resource on your machine though. |
Here is a two docker and one maria db database. |
This PR adds the ability for MySQL to work as an HA Backend. A large portion of this code is lifted from the Zookeeper HA Locking however it seems most of the HA Backends are using functionally the same code with small pieces changed around how to check the underlying locking mechanism. In this case it is using the following built in MySQL functions GET_LOCK, RELEASE_LOCK, and IS_USED_LOCK.
Some things to note about this... The
monitorLock
function has to poll the DB. I think 5 seconds is a reasonable time between checks but I'm not sure how others feel. The only time this should find that the lock is not open is if the connection gets lost in which case another vault instance should have already grabbed the lock or someone is in the DB playing with mysql internals.If you would like to poke around at the locks you can run this in MySQL.
Then you can execute the following to return all active locks.
Here is the output from running the local unit tests.
I manually tested this using the following configs files and test plan.
Config 1
Config 2 - Replace ip to conform to your environment.
For quickly starting the MySQL backend for this test I used this docker-compose file.
I tried to match the existing go code style in the repo as best as I could but I haven't written much go in the past year and am a bit rusty.
Reference Issue: #4665