Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL HA Backend Support #4686

Merged
merged 17 commits into from
Aug 13, 2018
Merged

MySQL HA Backend Support #4686

merged 17 commits into from
Aug 13, 2018

Conversation

michaeljs1990
Copy link
Contributor

@michaeljs1990 michaeljs1990 commented Jun 3, 2018

This PR adds the ability for MySQL to work as an HA Backend. A large portion of this code is lifted from the Zookeeper HA Locking however it seems most of the HA Backends are using functionally the same code with small pieces changed around how to check the underlying locking mechanism. In this case it is using the following built in MySQL functions GET_LOCK, RELEASE_LOCK, and IS_USED_LOCK.

Some things to note about this... The monitorLock function has to poll the DB. I think 5 seconds is a reasonable time between checks but I'm not sure how others feel. The only time this should find that the lock is not open is if the connection gets lost in which case another vault instance should have already grabbed the lock or someone is in the DB playing with mysql internals.

If you would like to poke around at the locks you can run this in MySQL.

UPDATE performance_schema.setup_instruments SET enabled = 'YES' WHERE name = 'wait/lock/metadata/sql/mdl';

Then you can execute the following to return all active locks.

SELECT * FROM performance_schema.metadata_locks WHERE OBJECT_TYPE='USER LEVEL LOCK';

Here is the output from running the local unit tests.

$ make test TEST=github.com/hashicorp/vault/physical/mysql
==> Checking that code complies with gofmt requirements...
==> Checking that build is using go version >= 1.10...
go generate 
ok  	github.com/hashicorp/vault/physical/mysql	0.009s

I manually tested this using the following configs files and test plan.

  1. Start the first instance and unlock it ensuring it has entered the active state
  2. Start a second instance and unlock it ensuring it has entered the standby state
  3. Write some keys and read them back just for a sanity check
  4. Kill the active instance and ensure that the node in standby becomes active
  5. Start the node that was just killed again and make sure it has come back up in standby
  6. Read the keys written before to ensure everything is happy
  7. Run step-down on first node and ensure the second becomes the active immediately.
  8. Run step-down on the second node and ensure it fails back over to the first.
  9. Did this a bunch of times to try and poke at any race conditions this might have.
  10. Write keys to the standby node and ensure they are passed to the active leader.

Config 1

storage "mysql" {
  database = "vault"
  table    = "vault"
  username = "root"
  password = "vault"
  address = "0.0.0.0:3306"
}

listener "tcp" {
  address = "127.0.0.1:8200"
  tls_disable = 1
}

default_lease_ttl = "168h"
max_lease_ttl = "720h"

api_addr = "http://127.0.0.1:8200"

Config 2 - Replace ip to conform to your environment.

storage "mysql" {
  database = "vault"
  table    = "vault"
  username = "root"
  password = "vault"
  address = "0.0.0.0:3306"
}

listener "tcp" {
  address = "10.0.2.5:8200"
  tls_disable = 1
}

default_lease_ttl = "168h"
max_lease_ttl = "720h"

api_addr = "http://10.0.2.5:8200"

For quickly starting the MySQL backend for this test I used this docker-compose file.

I tried to match the existing go code style in the repo as best as I could but I haven't written much go in the past year and am a bit rusty.

Reference Issue: #4665

This adds support for HA mode when using the MySQL backend. This
needs some work still to clean up some ugly code and some tests
added. However in it's current state this will failover fine to
a secondary node.
Due to the need for checking if a lock is held even in the case when
you don't have a lock open I have moved the code for this up into the
MySQLHABackend struct.
This lets the world know that you can use MySQL as an HA backend! It also
removes some code I commented out when I was getting this working and
forgot to delete.
This correctly only passes in one argument which is the lock name to
the mysql RELEASE_LOCK function.
@michaeljs1990
Copy link
Contributor Author

It looks like the one test is consistently failing however I am unable to tell what exactly is going on. I ran the make testtravis command locally and it passes however that seems to be the one failing when I pull down the last failing log with curl -O https://api.travis-ci.org/v3/job/387312331/log.txt.

Caught this when testing the step-down function and ensuring that it works.
It seems that about 50% of the time the lock will be returned to the current
node. I think the 5 second sleep may be the reason for this although I would
still expect the stanby node to take over since after step-down the former
leader sleeps for 10 seconds.
This closes the MySQL connection instead of trying to release the lock. For whatever
reason when calling RELEASE_LOCK mysql would just hang even when setting a timeout. I could
find no documentation on this behavior online. Closing the connection completely will 100%
of the time free the lock.
This removes some useless comments and also adds comments for code
that is not super clear or is very specific to MySQL.
@jefferai jefferai added this to the 0.10.3 milestone Jun 3, 2018
@michaeljs1990
Copy link
Contributor Author

Found a bug since my local testing missed this where I'm not returning the leader UUID. Will push up a change to address this shortly.

@michaeljs1990 michaeljs1990 changed the title MySQL HA Backend Support [Work In Progress] MySQL HA Backend Support Jun 4, 2018
This properly find the node that is the leader and forwards requests to it.
Previously requests would not be forwarded to the active leader and would
just 500.
@michaeljs1990 michaeljs1990 changed the title [Work In Progress] MySQL HA Backend Support MySQL HA Backend Support Jun 4, 2018
@michaeljs1990
Copy link
Contributor Author

The latest commit fixes an issue since I didn't fully understand how the Value() HA function was working. This now properly forwards requests onto the leader. This has a fair amount of cleaning up to do but would like to get some eyes on it as this turned out to be a bit larger PR than I thought it would be 😆

This will wait until after we have given it a decent try and removing
the active leader and then close the connection...
Don't log to stdout anymore... cleanup debug info used to track
why a lock wasn't being found before.
@michaeljs1990
Copy link
Contributor Author

Seems the unit tests were working... as is usually the case :)

@michaeljs1990
Copy link
Contributor Author

I believe that I need to move my table that I created for the lock to use the default datastore that is encrypted.

@jefferai
Copy link
Member

jefferai commented Jun 8, 2018

The lock definitely needs to be encrypted! Vault should do this automatically for you though...?

@michaeljs1990
Copy link
Contributor Author

That is correct I wasn't thinking when I wrote that code though :| so i'll fix that up before wasting anyones time looking over this.

@michaeljs1990 michaeljs1990 changed the title MySQL HA Backend Support [Changes Planned] MySQL HA Backend Support Jun 8, 2018
@jefferai jefferai modified the milestones: 0.10.3, 0.10.4 Jun 11, 2018
@michaeljs1990
Copy link
Contributor Author

michaeljs1990 commented Jun 21, 2018

Just to keep people in the loop I am still going to make the changes mentioned above but since I need to use this at work at the same time for another project the current posted diff is actively being used by me in a pre-prod environment. Hope to get the changes above done shortly.

This was causing a large load on the DB since it was executed very often and
kept having to prepair the query.
@michaeljs1990
Copy link
Contributor Author

Took a look last night on how to refactor this so we are storing the leader uuid encrypted as well. You mentioned encrypting the lock which although it may be possible i'm not sure how much it will buy us. The lock is just a simple key i believe core/lock or something similar that is stored in the general mysql system tables since you can't use GET_LOCK and it's sister functions with a specific table.

@michaeljs1990
Copy link
Contributor Author

michaeljs1990 commented Jun 27, 2018

It also seems like the Value() Lock function gets called a lot i'm seeing 27 selects per second just from this call alone in a cluster size of three does that seem correct or possibly I messed something up in my implementation.

EDIT: never mind me I see it's a call I am making :) fixing it now.

@michaeljs1990 michaeljs1990 changed the title [Changes Planned] MySQL HA Backend Support MySQL HA Backend Support Jul 1, 2018
@michaeljs1990
Copy link
Contributor Author

I'm sure you all are busy but just want to make sure that you aren't waiting on me for anything currently.

Small update this code is running in prod now for the past week without issue.

@jefferai
Copy link
Member

From my perspective the nicest thing would be to get other MySQL users to test it out and/or review the code -- it's not one we support. We can do a quick look-see and see if it seems sane to us and merge it, but a lot of the HA stuff ends up being subtle, and physical.ExerciseHABackend only goe so far. You may want to post on the mailing list that you created this and see if other users want to review/test.

@michaeljs1990
Copy link
Contributor Author

Thanks @jefferai i'll send it out to the mailing list.

@cappyzawa
Copy link

cappyzawa commented Jul 13, 2018

Hi, @michael-k
I found Vault MySQL HA Backend in google group.

I wanted to cooperate with the test because I was longing for this function!

@michaeljs1990
Copy link
Contributor Author

Awesome glad to hear. Let me know if you need help with anything.

@cappyzawa
Copy link

cappyzawa commented Jul 14, 2018

@michaeljs1990

I tried to your function.
But error occurred.

  1. checkout out mysql-ha-mode branch.
  2. create Dockerfile like the following.
FROM golang:1.10.3-alpine3.7

EXPOSE 8200
RUN apk --update add \
    && apk --no-cache add make git bash

COPY . $GOPATH/src/github.com/hashicorp/vault
WORKDIR $GOPATH/src/github.com/hashicorp/vault
RUN make bootstrap \
    && make

CMD ./bin/vault server -config=instance1.hcl

instance1.hcl is following.

storage "mysql" {
  username = "ha"
  password = "xxxxxxxxxxxxxxxxxx"
  address = "vault-ha-xxx.com:3306"
  database = "vault"
}

listener "tcp" {
  address = "0.0.0.0:8200"
  tls_disable = 1
}

default_lease_ttl = "168h"
max_lease_ttl = "720h"

api_addr = "http://0.0.0.0:8200"
  1. build dockerfile.
docker build -t vault-ha .
  1. run docker container
$ docker run --name vault-ha1 -d -p 8200:8200 vault-ha:latest 
  1. Instance is running. But error occurred.
    The log is below.
$ export VAULT_SKIP_VERIFY=true
$ export VAULT_ADDR=http://127.0.0.1:8200
$ vault status
Key                Value
---                -----
Seal Type          shamir
Sealed             true
Total Shares       5
Threshold          3
Unseal Progress    0/3
Unseal Nonce       n/a
Version            0.10.1
HA Enabled         true
$ ./bin/vault operator unseal
Unseal Key (will be hidden): 
Key                Value
---                -----
Seal Type          shamir
Sealed             true
Total Shares       5
Threshold          3
Unseal Progress    1/3
Unseal Nonce       0ff8ceb9-7c4f-ec6b-76f5-711260c7c02c
Version            0.10.1
HA Enabled         true
$ ./bin/vault operator unseal
Unseal Key (will be hidden): 
Key                Value
---                -----
Seal Type          shamir
Sealed             true
Total Shares       5
Threshold          3
Unseal Progress    2/3
Unseal Nonce       0ff8ceb9-7c4f-ec6b-76f5-711260c7c02c
Version            0.10.1
HA Enabled         true
$ ./bin/vault operator unseal
Unseal Key (will be hidden): 
Error checking leader status: Error making API request.

URL: GET http://127.0.0.1:8200/v1/sys/leader
Code: 500. Errors:

* sql: no rows in result set

$ docker logs vault-ha1
==> Vault server configuration:

             Api Address: http://0.0.0.0:8200
                     Cgo: disabled
         Cluster Address: https://0.0.0.0:8201
              Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "0.0.0.0:8201", tls: "disabled")
               Log Level: info
                   Mlock: supported: true, enabled: true
                 Storage: mysql (HA available)
                 Version: Vault v0.10.1
             Version Sha: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+CHANGES

==> Vault server started! Log data will stream in below:

2018-07-14T17:25:06.311Z [INFO ] core: vault is unsealed
2018-07-14T17:25:06.312Z [INFO ] core: entering standby mode
2018-07-14T17:25:06.765Z [ERROR] core: failed to acquire lock: error="Error 1105: Percona-XtraDB-Cluster prohibits use of GET_LOCK with pxc_strict_mode = ENFORCING"

mysql tables are below.

mysql> show tables;
+-----------------+
| Tables_in_vault |
+-----------------+
| vault           |
| vault_lock      |
+-----------------+
2 rows in set (0.20 sec)

mysql> desc vault_lock;
+----------------+--------------+------+-----+---------+-------+
| Field          | Type         | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+-------+
| node_job       | varchar(512) | NO   | PRI | NULL    |       |
| current_leader | varchar(512) | YES  |     | NULL    |       |
+----------------+--------------+------+-----+---------+-------+
2 rows in set (0.07 sec)

mysql> select * from vault_lock;
Empty set (0.07 sec)

Is there anything wrong with my settings:(?

@michaeljs1990
Copy link
Contributor Author

michaeljs1990 commented Jul 14, 2018

The error you are getting is specific to your MySQL backend Percona XtraDB Cluster. MySQL, MariaDB, and RDS (Aurora) is all that I tested with. I don't think any code changes need to happen but I'll add some docs for getting this to work with Percona since it's specific to that flavor of MySQL.

Hopefully this is reproducible with the default setting in the https://hub.docker.com/_/percona/ image.

@michaeljs1990
Copy link
Contributor Author

michaeljs1990 commented Jul 14, 2018

You will need to set pxc_strict_mode = MASTER on your node. If you are using a cluster in which multiple nodes are allowed to write you could run into issues however. If this is not the case you will be fine.

@cappyzawa
Copy link

cappyzawa commented Jul 15, 2018

I see.
Is the problem that it is possible to write to multiple Nodes and that LockTable was unusable?

I changed the configuration of cluster from active/active to active/standby.
And I tried again.
Then, I succeeded in unsealing.

I confirmed the operation of the ten steps you presented.
LGTM:+1:

@michaeljs1990
Copy link
Contributor Author

michaeljs1990 commented Jul 15, 2018

https://mariadb.com/kb/en/library/get_lock/ - get lock works on a server wide bases which is broken in all active clusters. Here is also a good wiki that was written on the issue by another company that ran into this limitation https://docs.ubersmith.com/display/UbersmithDocumentation/Clustering+Databases.

@cappyzawa
Copy link

Thanks for description:+1:

@chrishoffman chrishoffman modified the milestones: 0.10.4, 0.10.5 Jul 19, 2018

// Get or set MySQL server address. Defaults to localhost and default port(3306)
address, ok := conf["address"]
if !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this block doesn't have a closing bracket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha thanks should have looked here a few minutes ago while I was getting fun errors from go fmt like this.

mschuett-C02TC2VYGTDY:vault schuettm$ go fmt ./physical/mysql/mysql.go
physical/mysql/mysql.go:257:37: missing ',' in argument list
physical/mysql/mysql.go:259:2: missing ',' in argument list
physical/mysql/mysql.go:260:3: expected operand, found 'return'
physical/mysql/mysql.go:261:3: missing ',' before newline in argument list
physical/mysql/mysql.go:262:21: expected '==', found '='
physical/mysql/mysql.go:263:2: expected operand, found 'return'
physical/mysql/mysql.go:264:1: expected operand, found '}'
physical/mysql/mysql.go:268:2: missing ',' in argument list
physical/mysql/mysql.go:270:24: missing ',' before newline in argument list
physical/mysql/mysql.go:271:2: expected operand, found 'defer'
physical/mysql/mysql.go:273:9: missing ',' in argument list
exit status 2

@michaeljs1990
Copy link
Contributor Author

Just an update on this. This is running in prod now and I haven't seen any issues with Vault using this for the past month. It's under fairly decent load at all times and I still have it set to failover to a new leader every 8 hours per the max connection time in mysql which breaks the lock just to ensure all the code around acquiring locks is working as expected. Ran into some other issues but that was just because I didn't read the docs around periodic tokens well enough 😆 😭 .

Let me know if I can do anything to help push this along.

Thanks,

@jefferai
Copy link
Member

Thanks!

@jefferai jefferai merged commit 88fe0fa into hashicorp:master Aug 13, 2018
@michaeljs1990
Copy link
Contributor Author

@jefferai thanks a ton for merging this in!

@jefferai
Copy link
Member

Thanks for submitting it! Sorry about the delay, we're just in mad crunch mode for 0.11

@killermoon
Copy link

we have problems with HA with Mysql backend. could you please review the logs ?
ewzcfin_vault_acc.log
ewzcfin_vault_err_acc.log

@michaeljs1990
Copy link
Contributor Author

michaeljs1990 commented Mar 13, 2019

Hey, thanks for using the MySQL HA backend.

Could you open a ticket and just ping me on it and I'll take a look. Some other information for others filing tickets against this in the future that is helpful is the MySQL flavor that you are using as well as the environment that you are running in (docker, aws, vagrant, whatever). If you have a docker-compose that is reproducible that is even better. From briefly looking at your log this looks like you have exhausted some resource on your machine though.

@killermoon
Copy link

Here is a two docker and one maria db database.
We have exhausted some resource on our machine ? what do you like to say with this ?

@michaeljs1990 michaeljs1990 deleted the mysql-ha-mode branch June 21, 2019 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants