-
-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure sqldata thread safety. #760
Conversation
…ef counter is simply not good enough.)
Codecov Report
@@ Coverage Diff @@
## master #760 +/- ##
==========================================
+ Coverage 89.95% 89.96% +<.01%
==========================================
Files 78 78
Lines 10804 10809 +5
Branches 1118 1120 +2
==========================================
+ Hits 9719 9724 +5
Misses 959 959
Partials 126 126
Continue to review full report at Codecov.
|
We have now run coverage across our entire code base multiple times with this fix and it works. We have not seen any more failures. |
We have now run with this fix over 10 days in "production" and have not seen any failures. We have done 100+ runs at 9 computing hours each. |
Thanks, this looks good. I wrote a test:
This PR makes this test pass. I wish the other linked issues had the same failures, so we had some confidence that this fixes them also. |
But there's something I don't understand: without this PR, the error is very clear and should always happen in multi-threaded code: "SQLite objects created in a thread can only be used in that same thread." I have tests of multi-threaded coverage measurement, and they don't raise that error? Why is that? |
Oh, right: in a multi-threaded program, the threads all write to the same dictionary, which is then written to the CoverageData object at the end of the program. So now I'm wondering how you produced errors when running coverage? |
@nedbat We had race conditions appearing around the counter. If the Python interpreter decided to switch to the other thread after incrementing or decrementing the counter but before executing the DB call, we would get failures due to inconsistent state. It happened about 1% of the time, but that meant that our test suite always failed, since we have 100+ parallel workers. ;-) |
In your test results, note that the "Could not use a SQLite object from another thread" error is a bit of a red herring. You can allow objects to be accessed across threads and the error frequency decreases but at the end it is only symptom. The |
I really wish I had a way to reproduce the problems, however crude. |
You probably can't, since the bug depends on Python's thread switching. In Python 2 you could specify the switch interval in terms of instructions and setting it to -1 would always switch . But in Python 3.2 they changed that to a time interval in seconds (to support async) and so this opportunity is gone. You could monkey patch the DB operations and simulate thread switching in that. Or you could make the counter a int-like class that has side effects like simulating a thread switch. |
Can I run your test suite that originally caused you the headaches? |
@nedbat Not easily, since it is our internal test suite. We could find a day/time to meet up and work on together. This way you could use my laptop and some customized |
@strichter: what about the simpler #723 as a fix? Can we tease apart the different sorts of issues to choose the right solutions? |
How does the upsert pattern help here? There is a race condition problem due to threading. You can either implement the locking yourself or have the library manage that, since it does already do that using a file lock. I feel that the changeset in this PR is as minimal as it gets. |
self._file_map[path] = id | ||
|
||
def _connect(self): | ||
if self._db is None: | ||
if get_thread_id() not in self._dbs: | ||
if os.path.exists(self.filename): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@strichter I'll merge this today, but I'm wondering about a race condition on the creation of the database. There's a window here between checking if the file exists, and writing the initial metadata into the database. Is there a reason we don't see failures due to that race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! I think that coverage starts collecting so early that it will always hit the initialization case before any threads are spawned. This is at least true in our setup where we run coverage as a shell script. In other words the first time this condition hits, we are in the main thread.
When running coverage with Sqlite3 on our large code base (which starts many processes and threads), we noticed failures due to race conditions. It could happen that the
nest
counter had been already increased or decreased but the associated action had not been completed and thus another thread started work under wrong assumptions.This PR changes the behavior by providing every thread with a separate connection. The sqlite3 library already handles cases where database files are locked for writing, which we know works well with forked processes. Thus, this PR basically provides the same connection separation for threads as it did already for forked processes.
BTW, the
nest
attribute is still needed for cases where the context manager is nested inside the same thread.