Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add rejectReadOnly option (to fix AWS Aurora failover) #604

Merged
merged 6 commits into from
May 31, 2017
Merged

add rejectReadOnly option (to fix AWS Aurora failover) #604

merged 6 commits into from
May 31, 2017

Conversation

songgao
Copy link
Contributor

@songgao songgao commented May 29, 2017

Description

Hi there, we use this (awesome) driver with AWS Aurora's MySQL-compatible Cluster. In some recent manual tests with Aurora's automatic failover, we noticed that there might be a race condition in Aurora's failover implementation, where our servers (mysql client) get connected to a read-only replica after failover.

This should be a rare case, since normally during a failover the primary is unreachable before failover to the read replica, and reconnecting should only succeed with the new primary (writable) instance). However, in an event where the old primary (new read-only) replica recovers before failover completes, this could still happen. In addition, if we manually trigger a failover from AWS's web control panel, both replicas are still accessible, making this is very easy to reproduce.

This PR adds a rejectReadOnly option to DSN, which when set to true, causes the driver to drop the connection if error 1792 (ER_CANT_EXECUTE_IN_READ_ONLY_TRANSACTION) is received. This way, the connection pool (from database/sql) purges the connection, and makes a new one for the next statement, bringing the storage layer back to healthy state.

This is apparently an AWS bug, and they should be fixing it. We are making this PR because 1) it's a much longer process to get AWS to fix the bug (assuming they acknowledge), while adding an option to the driver is a quick fix for everybody; 2) if the same behavior happens with other service providers, this can be a useful remedy too.

Checklist

  • Code compiles correctly
  • Created tests which fail without the change (if possible)
  • All tests passing
  • Extended the README / documentation, if necessary
  • Added myself / the copyright holder to the AUTHORS file

@coveralls
Copy link

coveralls commented May 29, 2017

Coverage Status

Coverage decreased (-0.6%) to 73.943% when pulling 83d56b7 on keybase:master into 382e13d on go-sql-driver:master.

Copy link
Member

@julienschmidt julienschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be awesome if you could add a test, i.e. by manually (temporarily) setting the MySQL server to read-only mode.

README.md Outdated
```


RejectreadOnly causes mysql driver to reject read-only connections. This is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/mysql driver/the driver/

Please keep the explanation general, i.e. independent from AWS Aurora, which can however be mentioned as an example. For example explain that failover instances might be read-only and that this option purges them from the connection pool. Please also don't mention internals like "returning a driver.ErrBadConnection".

@songgao
Copy link
Contributor Author

songgao commented May 29, 2017

@julienschmidt Thanks for the prompt response! Please take another look.

@coveralls
Copy link

coveralls commented May 29, 2017

Coverage Status

Coverage decreased (-0.04%) to 74.531% when pulling 36053f8 on keybase:master into 382e13d on go-sql-driver:master.

@coveralls
Copy link

coveralls commented May 29, 2017

Coverage Status

Coverage decreased (-0.04%) to 74.531% when pulling a2e9438 on keybase:master into 382e13d on go-sql-driver:master.

@coveralls
Copy link

coveralls commented May 29, 2017

Coverage Status

Coverage decreased (-0.04%) to 74.531% when pulling be8ab32 on keybase:master into 382e13d on go-sql-driver:master.

@coveralls
Copy link

coveralls commented May 29, 2017

Coverage Status

Coverage decreased (-0.04%) to 74.531% when pulling d2fefe8 on keybase:master into 382e13d on go-sql-driver:master.

@methane
Copy link
Member

methane commented May 30, 2017

(This is not comment about PR, just an advice)
@songgao Do you use DB.SetConnMaxLifetime()?
Without it, database/sql can keep connection for very long time.
Very slow failover is one of many troubles of long living connections.

I recommend to set max lifetime to 10sec~1min.

@songgao
Copy link
Contributor Author

songgao commented May 30, 2017

Thanks @methane; that's good advice! I don't think we set it for now; we should as you said.

In case of AWS Aurora though, the Aurora cluster terminates all connections during the failover; we just get connected to the wrong replica right after the failover :( But yeah, setting the max life time would definitely prevent the service from being stuck in a read-only connection forever.

@coveralls
Copy link

coveralls commented May 30, 2017

Coverage Status

Coverage decreased (-0.04%) to 74.846% when pulling fb11b2c on keybase:master into 44fa292 on go-sql-driver:master.

Copy link
Member

@julienschmidt julienschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for some minor documentation stuff this PR seems fine to me now.

README.md Outdated
```


RejectreadOnly causes the driver to reject read-only connections. This is for a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/RejectreadOnly/rejectReadOnly/

README.md Outdated
Note that this should be a fairly rare case, as an automatic failover normally
happens when the primary is down, and the race condition shouldn't happen
unless it comes back up online as soon as the failover is kicked off. On the
other hand, when this happens, an mysql application can get stuck on a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/mysql/MySQL/

dsn.go Outdated
@@ -53,6 +53,7 @@ type Config struct {
InterpolateParams bool // Interpolate placeholders into query string
MultiStatements bool // Allow multiple statements in one query
ParseTime bool // Parse time values to time.Time
RejectReadOnly bool // Reject read-only connections; see README.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the "; see README.md", all of these params are explained in the README.md

@coveralls
Copy link

coveralls commented May 31, 2017

Coverage Status

Coverage decreased (-0.04%) to 74.846% when pulling 91f7309 on keybase:master into 44fa292 on go-sql-driver:master.

@songgao
Copy link
Contributor Author

songgao commented May 31, 2017

(did a force push to trigger travis build)

@coveralls
Copy link

coveralls commented May 31, 2017

Coverage Status

Coverage decreased (-0.04%) to 74.846% when pulling bf068b4 on keybase:master into 44fa292 on go-sql-driver:master.

@julienschmidt julienschmidt merged commit b644770 into go-sql-driver:master May 31, 2017
@julienschmidt
Copy link
Member

Thanks!

@songgao
Copy link
Contributor Author

songgao commented May 31, 2017

@julienschmidt Thanks for accepting the patch!

@julienschmidt julienschmidt added this to the v1.4 milestone May 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants