Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rust: switch to stable compilers #752

Merged
merged 4 commits into from
Sep 7, 2020
Merged

rust: switch to stable compilers #752

merged 4 commits into from
Sep 7, 2020

Conversation

cyphar
Copy link
Contributor

@cyphar cyphar commented Sep 2, 2020

The only dependency which required nightly compilers was pyo3, which
supports stable Rust (1.39.0 and later) since version 0.11.0. Supporting
stable Rust makes it easier to package Anki for distributions.

In addition, we can't link against libzip2.so.1 when building manylinux
wheels, but the only dependency which required this was "zip" and we
don't appear to make use of bzip2 in Anki.

One additional problem is that Anki was inadvertently violating Rust's aliasing
rules (it was implicitly passing out multiple &mut Backend references) and
the pyo3 update revealed this behaviour. The solution is to make all of the
methods take &self -- removing all PyMutBorrowErrors -- and then clean up
anything that failed to compile. Since Backend already had plenty of internal
locking (further pointing to aliasing violations) this wasn't too complicated
to do.

Signed-off-by: Aleksa Sarai [email protected]

@cyphar cyphar force-pushed the rust-deps branch 2 times, most recently from e96d8bc to 850ae85 Compare September 2, 2020 12:35
@dae
Copy link
Member

dae commented Sep 3, 2020

I have no fundamental objections to switching to stable, but currently 'make check' is failing - either another way of excluding that file needs to be found, or it needs to be automatically formatted when it's created

@cyphar
Copy link
Contributor Author

cyphar commented Sep 3, 2020

The simplest solution might be to support compiling with stable but use cargo +nightly fmt for make check -- which isn't really used outside of development. Though requiring both compilers be installed would be a pain, so I'll see if there's another way to exclude the generated file from cargo fmt (or just format only that file after generation but before the check).

@cyphar
Copy link
Contributor Author

cyphar commented Sep 3, 2020

Alright, build.rs now will automatically rustfmt the generated protobuf code.

@dae
Copy link
Member

dae commented Sep 3, 2020

Thanks for doing that. Have you used this in testing for any period of time yet? I seem to recall issues using a newer pyo3 - it was previously rolled back in 17edbd1

@dae
Copy link
Member

dae commented Sep 3, 2020

An easy way to trigger it is to use Tools>Check Database.

@cyphar
Copy link
Contributor Author

cyphar commented Sep 3, 2020

Yup, I just managed to trigger a PyBorrowMutError (I could've sworn I tried it before sending a PR and it worked, but I might've tested an old build). I'll take a look at what is needed to fix the BorrowMut problems -- there is a migration document for PyO3 which mentioned this issue (I did try to check if this would be a problem before opening the PR but I think I misunderstood exactly what it meant when talking about PyCell -- I haven't used pyo3 before).

@dae
Copy link
Member

dae commented Sep 3, 2020

I fear there's no easy fix for this. We can't chuck a mutex at it because we need to be able to run multiple calls in parallel. Changing the code to run things like syncing on a background thread is probably the long term fix, but it will be a some time before I have a chance to look into that.

prost-build doesn't generate rustfmt-safe code, so we had to add it to
the ignore list for rustfmt on "make check". However, the ignore list
isn't supported by stable rustfmt, so we have to work around this some
other way -- in this case, just do "rustfmt" on the generated file in
"build.rs" (this way, formatting errors in checked-in source code are
still caught but generated code doesn't cause spurrious errors).

Signed-off-by: Aleksa Sarai <[email protected]>
@cyphar
Copy link
Contributor Author

cyphar commented Sep 4, 2020

tl;dr I think I found the easy fix for this.

I started working on it yesterday and I have a version which doesn't trigger PyBorrowMutError errors, and doesn't require putting everything behind Mutex<Backend>. In fact it currently only adds a Mutex for the sync_abort in the backend -- and I actually think that the sync_abort backend handling could be done in a nicer fashion. I've pushed a WIP patch which implements this, but I'll give a quick explanation for my thought process:

The main change which makes this entire procedure much simpler is to simply switch BackendService to take &self everywhere -- this entirely removes all PyBorrowMutError errors because we now no longer need mutable borrows (and it eliminates a whole class of runtime errors). In fact almost all of the &mut usage is actually not needed because self.col, self.progress_state and self.state are all behind internal Mutexes anyway!

Making this minor change reveals that there only two things that aren't Mutexed:

  1. self.runtime which is used as a tokio::runtime::Handle cache. Putting this behind a Mutex does solve things, but it's less than ideal -- we only ever take a clone of the handle on each self.runtime_handle() call anyway (so we only ever need immutable &T access)! In order to keep the current lazy-initialisation behaviour I just opted to use once_cell which gives us only immutable references and uses less heavy synchronisation than a full-on Mutex.

  2. self.sync_abort is a bit uglier to hack around. The solution I went with was to just wrap it in a Mutex (and then only hold the lock when we set or clear it) -- this works and should not cause /too much/ lock contention. I also decided to make this use an RAII-style guard but I can switch this back to being open-coded.
     
    However while implementing this, I noticed that sync_abort doesn't handle multiple sync operations being queued -- meaning that if you have an existing AbortHandle registered it will be cleared if another thread tries to trigger a sync at the same time. The code now can detect this and will trigger a panic! (not ideal, just did it this way to make it simpler to show a PoC) but a nicer solution would be to make sync_abort a Vec<AbortHandle> -- and then when we do the abort we abort all existing sync operations (or we abort any old sync operation if a new one is requested).

With these changes, I can now build Anki and it no longer has any PyBorrowMutErrors -- and as far as I can tell Anki works perfectly fine (no performance problems, synchronisation works perfectly fine, and Check Database works without a hitch). I'm not sure exactly how extensively you would like me to test things, but it does appear to work with the latest patch I've pushed.

(As an aside, this is probably also a general memory safety issue because the need for locking behind a &mut makes little sense and seems to indicate that there are actually memory safety issues (aliasing &mut references is undefined behaviour and I think that's what's happening right now -- hence the breaking change in pyo3). I imagine if you actually had two threads in Python while the GIL is released calling operations which involved sync_abort you probably will have lots of data races because they will write to the same memory without any locks.)

@cyphar cyphar force-pushed the rust-deps branch 2 times, most recently from 4ea905c to 9fda265 Compare September 4, 2020 07:00
@cyphar

This comment has been minimized.

@dae
Copy link
Member

dae commented Sep 5, 2020

Appreciate you looking into it - that didn't turn out to be as bad as I was expecting.

Rather than having the runtime created as needed, I'd been planning to pass a handle in at backend creation time, so that a single runtime can be shared by multiple collections & configured by the callers. It's not urgent though, so happy to run with once_cell for now if you don't feel like changing it.

A vec or hashmap of abort handles would be better, but until then, I'd prefer the code doesn't panic - just logging to the console should suffice.

@cyphar
Copy link
Contributor Author

cyphar commented Sep 5, 2020

In addition to the following, I'll remove BackendAbortGuard and simply use scopeguard to avoid cluttering the code with our own RAII guard implementation for something as trivial as this.

Rather than having the runtime created as needed, I'd been planning to pass a handle in at backend creation time, so that a single runtime can be shared by multiple collections & configured by the callers. It's not urgent though, so happy to run with once_cell for now if you don't feel like changing it.

I'll take a look but if it ends up being a bit odd to implement, I can do that in a follow-up PR.

A vec or hashmap of abort handles would be better, but until then, I'd prefer the code doesn't panic - just logging to the console should suffice.

Yup, I'll just make it log for now -- the slightly tricky thing is that AbortHandle and AbortRegistration don't implement Eq (or Hash) so we'd need to generate our own unique IDs each time and that'll just be a bit too hairy to do in the middle of this PR. I'd be happy to look into it in the future though.

Though it should be noted that media_sync_abort handles this by just making sync_media_inner a no-op...


One other thing I wanted to mention which I noticed while looking a bit more at the abort code is that there appears to be three different ways of aborting an operation:

  • progress_state.wants_abort which appears to be a mechanism to tell a program that uses the progress_fn that the user has requested that they stop doing work.
  • state.media_sync_abort and sync_abort which are split into handles for aborting media sychronisation and collection synchronisation (but not media?) respectively.

I won't touch these in this PR (since it's not really relevant), but I do wonder whether this could be simplified a little bit -- is there a case where a user or plugin would want to abort media synchronisation but not any other kind of synchronisation that is happening at the same time?

In addition (maybe I'm wrong here) but shouldn't the whole Future abort system avoid the need for progress_state.wants_abort? If you also abort the future which will call progress_fn and be told it should stop doing work, then it won't ever read wants_abort. Or do long-running threads get spawned which need an explicit mechanism to say "stop working"?

Thanks.

@cyphar
Copy link
Contributor Author

cyphar commented Sep 5, 2020

I'll take a look but if it ends up being a bit odd to implement, I can do that in a follow-up PR.

I'll leave the backend configuration for a separate PR since we'd need to update the protobuf schema in order to add the necessary information so that Python can configure the runtime -- and if you wanted to share the same runtime among Backends you'd actually need to have a corresponding #[pyclass] of some kind to represent a Runtime. But I can take a look at that some other time.

The previous implementation had some slightly questionable memory safety
properties (older versions of PyO3 didn't uphold the Rust aliasing rules
and would thus create multiple &mut references to #[pyclass] objects).
This explains why Backend has internal Mutex<T>s even though all of its
methods took &mut self.

The solution is to simply make all methods take &self, which luckily
doesn't pose too make issues -- most of the code inside Backend already
has sufficient locking. The only two things which needed to be
explicitly handled where:

1. "self.runtime" which was fairly easy to handle. All usages of
   the Runtime only require an immutable reference to create a new
   Handle, so we could switch to OnceCell which provides
   lazy-initialisation semantics without needing a more heavy-handed
   Mutex<tokio::runtime::Handle>.

2. "self.sync_abort" was simply wrapped in a Mutex<>, though some of the
   odd semantics of sync_abort (not being able to handle multiple
   processes synchronising at the same time) become pretty obvious with
   this change (for now we just log a warning in that case). In
   addition, switch to an RAII-style guard to make sure we don't forget
   to clear the abort_handle.

As a result, we now no longer break Rust's aliasing rules and we can
build with newer versions of PyO3 which have runtime checks for these
things (and build on stable Rust).

Signed-off-by: Aleksa Sarai <[email protected]>
We can't link against libzip2.so.1 when building manylinux wheels, but
the only dependency which required this was "zip" and we don't appear to
make use of bzip2 in Anki.

This fixes the following "make build" error on Linux:

  x maturin failed
    Caused by: Failed to ensure manylinux compliance
    Caused by: Your library is not manylinux compliant because it links the following forbidden libraries: ["libbz2.so.1"]

Signed-off-by: Aleksa Sarai <[email protected]>
The only dependency which required nightly compilers was pyo3, which
supports stable Rust (1.39.0 and later) since version 0.11.0. Supporting
stable Rust makes it easier to package Anki for distributions. No other
code changes were required.

Signed-off-by: Aleksa Sarai <[email protected]>
@dae dae merged commit 0f1e277 into ankitects:master Sep 7, 2020
@dae
Copy link
Member

dae commented Sep 7, 2020

Thank you for your work on this!

The two separate methods of cancelling is ugly, but address different needs - the progress callback can be used to abort long-running operations on the current thread like "check media" and the media folder scan at the start of a media sync, and the abort handle is used to abort network operations that may hang for long periods of time without calling the progress handler. When the media checking code was written we hadn't yet bought into any of the async ecosystem, and while the code could theoretically be moved to a blocking future pool, AFAIK tokio can not actually terminate code running in the blocking pool - so an abort handle would return control, but the code would continue running until it completed in the background, with its computed results being discarded.

By hashmap, I was thinking of the key identifying which kind of operation it is, eg "media_sync" or an integer constant for each different operation.

@cyphar cyphar deleted the rust-deps branch September 7, 2020 07:11
dae added a commit that referenced this pull request Sep 22, 2020
rust: switch to stable compilers
@timokau
Copy link

timokau commented Dec 21, 2020

Just in case you aren't aware, currently an anki build still needs nightly rust to build the orjson dependency. This has been blocking the update for us at NixOS. Since one of the stated goals of this PR was to make packaging easier, I thought you might appreciate a heads-up.

In any case, thank you for your efforts! This PR is already a big step.

@cyphar

This comment has been minimized.

@cyphar
Copy link
Contributor Author

cyphar commented Dec 21, 2020

Oh, I misunderstood -- orjson is a Python dependency. Yeah that sucks, but aside from moving away from orjson there's not much Anki can do about it.

@dae
Copy link
Member

dae commented Dec 21, 2020

When I tested it in the past, orjson was significantly faster than the stock json library for bulk DB operations. You could patch the compatibility wrapper for the latter back into Anki if you want to avoid the pre-built orjson binary on PyPI, but your users would be getting a slower version of Anki.

Doesn't look like orjson's in a hurry to move away from nightly unfortunately: ijl/orjson#108

@timokau
Copy link

timokau commented Dec 21, 2020

You could patch the compatibility wrapper for the latter back into Anki if you want to avoid the pre-built orjson binary on PyPI

Thanks for the suggestion. Pre-built binaries can be problematic, not just from a purity standpoint but also from a pragmatic one (supporting different architectures for example). Slowness is probably better than being stuck on an old version. Can the standard library version be used as a drop-in replacement for Anki's purposes?

@dae
Copy link
Member

dae commented Dec 21, 2020

Yes: b0aedd6

@timokau
Copy link

timokau commented Dec 21, 2020

Great, thank you!

@dae
Copy link
Member

dae commented Jan 6, 2021

I've added the wrapper back to the latest git: f9e939a

@timokau
Copy link

timokau commented Jan 7, 2021

Thanks again :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants