-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spellchecking #51
base: main
Are you sure you want to change the base?
Conversation
There is a symspell package in rust https://github.com/reneklacan/symspell . Can we benchmark it and check whether the memory requirements are satisfied? What are the hard bounds on memory you're looking to support for this library?. And maybe we could integrate that package optionally perhaps if it's speed improvement is justified? |
Hi, I have to admit that I did not do a "proper" benchmark of SymSpell vs the FST approach. One of the problems with SymSpell is that the precomputes need to be either stored in the binary (which blows up the size of it) or precomputed every time when loading the
There are no hard bounds on memory. Up until now I decided trade-offs between speed and memory on a case by case basis, with a tendency to prefer speed. (also, notably, there are actually not that many places where this is relevant, often the best solution is both speed and memory efficient). One issue with this library currently is that there are two very different use cases:
So far I've tried to find middle ground between these two where it was relevant. For Spellchecking, the FST-based approach is better for (1) while the SymSpell approach might be better for (2).
I think this is the best solution. There is an open issue on modularizing the crate (#50) in a way in which integrating a second spellchecker would come quite naturally. So, in short:
|
The API for spellchecking is fully implemented now, this is how it will look like: Rustuse nlprule::{Rules, Tokenizer};
let tokenizer = Tokenizer::new("path/to/tokenizer.bin").unwrap();
// `Rules` now take an `Arc<Tokenizer>` as second argument, like in the Python API
// and do not require passing the tokenizer to the other methods anymore
let mut rules = Rules::new("path/to/rules.bin", tokenizer.into()).unwrap();
// by default, spellchecking is disabled
// the spellchecker can be accessed by `.spell()` and `.spell_mut()`
// and contains `SpellOptions` which can be accessed analogously
println!("{:#?}", rules.spell().options()); // prints the default options
// Enables spellchecking. `Variant` is a string newtype denoting valid variants
// the `.variant` method on the spellchecker gets the variant for a language code
// or returns an error if none exists
rules.spell_mut().options_mut().variant = Some(rules.spell().variant("en_GB").unwrap());
// now `rules.correct` and `rules.suggest` also finds spelling mistakes!
// there are some other options like a whitelist for words (the whitelist is actually a `HashSet<String>`)
rules
.spell_mut()
.options_mut()
.whitelist
.insert("nlprule".into());
// and that's it! there are also some new utility methods like `.variants()` on the spellchecker
// which are not too important Pythonimport pytest
from nlprule import Tokenizer, Rules
tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer)
# rules.spell is the spellchecker object
# does not do anything useful besides providing the ability to set and get options at the moment
print(rules.spell)
# the used variant is `None` by default i. e. spellchecking is disabled
assert rules.spell.options.variant is None
# setting the variant to one which does not exist prints a helpful error message
with pytest.raises(ValueError):
rules.spell.options.variant = "en_INVALID"
# setting the variant to a valid variant enables spellchecking
rules.spell.options.variant = "en_GB"
# any iterable works here, but it has to be assigned with =, the returned object is not mutable
rules.spell.options.whitelist = ["nlprule"]
assert rules.correct("nlprule has spelllchcking!") == "nlprule has spellchecking!" I am quite happy with how this turned out. I'd appreciate some feedback regarding the API from people who were previously interested in this (@theelgirl, @drahnr) and of course from anyone else who is reading this. I might have missed something. As usual, implementing this took more effort than anticipated (LT uses quite a bite more than just one wordlist for spellchecking) but I'm excited to add this functionality! There are some smaller things still left to do to get this over the finish line (see the checklist above). I hope to merge this PR & release next weekend. |
for ending in { | ||
"S": ["s"], | ||
"N": ["n"], | ||
"E": ["e"], | ||
"F": ["in"], | ||
"A": ["e", "er", "es", "en", "em"], | ||
"Ä": ["r", "s", "n", "m"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the language specific affix file instead of hardcoding it, it saves yourself and a few other people a lot of pain :)
let rules = Rules::new(opts.rules).unwrap(); | ||
let tokenizer = Arc::new(Tokenizer::new(opts.tokenizer).unwrap()); | ||
let mut rules = Rules::new(opts.rules, tokenizer.clone()).unwrap(); | ||
rules.spell_mut().options_mut().variant = Some(rules.spell().variant("en_GB").unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this very very un-ergonomic API.
A few particular points to avoid:
- mutable options in public API
- structs in public lib API
- circular borrow fun (need
rules.spell_mut()
andrules.spell()
at the same time) - combine artifacts that don't seem necessarily connected (spellcheck rules vs grammer / tokenizer rules)
A more rustic API imho would be:
// explicit type annotations for clarity
let spellcheck: Spellcheck = Spellcheck::new(path_to_spellcheck_rules)?;
// alt:
let spellcheck: Spellcheck = Spellcheck::from_reader(fs::File::opn(path_to_spellcheck_rules)?)?;
spellcheck.add_hunspell_dict
let previous: Option<Spellcheck> = rules.set_spell(spellcheck);
// and
rules.modify_spellcheck(move |spellcheck| Ok(spellcheck.extend_whitelist(words_iter)))?
Was there a particular reason to not split the spelling rules from the tokenization rules? A third file would not hurt and avoid having to (cargo-spellcheck
🎩 ) issues with deployment size - 900kB extra is a lot when the upper limit is 10MB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think about this. In my opinion a mutable SpellOptions
struct where all mutations are guaranteed to be valid and the spellchecker is updated when it is dropped is more erognomic than something like rules.modify_spellcheck(move |spellcheck| Ok(spellcheck.extend_whitelist(words_iter)))?
. E.g. it seamlessly allows all HashSet
operations on the whitelist, otherwhise that would have to be reimplemented as *_whitelist
methods.
mutable options in public API
Do you have some more information / source on this? I wouldn't have thought that this is bad.
circular borrow fun (need rules.spell_mut() and rules.spell() at the same time)
this wouldn't compile, you need the speller to get a Variant
(the Variant
is owned, it doesn't need the spellchecker), then you can use the Variant
to set the options. But I'm also not perfectly happy with that part.
combine artifacts that don't seem necessarily connected (spellcheck rules vs grammer / tokenizer rules)
This is a very valid point. The reason is that .correct()
and .suggest()
on the Rules
should easily be configurable to do both, grammar and spelling correction. The best fix would probably be some meta-struct to combine any amount of suggestion providers (where Rules
and Spellcheck
are each one suggestion provider) which is strongly related to #50. Defining and implementing that would take significantly more work than I want to do before nlprule can do spellchecking 🙂
A third file would not hurt
A third binary and a function Rules.set_spell
really does seem like a good idea. But I would argue that it decreases ergonomics, at least as long as the Spell
structure is still part of the Rules
.
when the upper limit is 10MB.
I'd just ask the crates.io team to increase the size so you don't have to worry about this (especially since (I think) you distribute some hunspell data as well). They do it on a case by case basis (rust-lang/crates.io#195). I understand the concern, but I'd argue that it is already quite impressive to store multiple hundred thousand valid words + lemmas + part-of-speech tags, rules, a noun chunker etc. in < 10MB. An extra 900kb for spellchecking should very rarely be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's one alternative API moving some of the changes anticipated for #50 to this PR without going completely out of scope:
A new trait Suggest
:
trait Suggest {
fn suggest(&self, text: &str, tokenizer: &Tokenizer) -> Vec<Suggestion>; // or iterator
fn correct(&self, text: &str, tokenizer: &Tokenizer) -> String {
// default impl to just apply all the suggestions
}
}
then Rules
and a new separate struct Spell
implement Suggest
. The spell struct implements from_reader
, new
, etc. like the Rules
but with an additional argument to denote the variant: Spell::new("path/to/bin", "en_GB")
.
There is one new struct Correcter
with the fields:
struct Correcter {
suggesters: Vec<Box<dyn Suggest>>
}
that can be constructed with
let correcter = Correcter::new(vec![
Rules::from_reader(...).into(),
Spell::from_reader(..., "en_GB").into()
])?; // implements `FromIterator` too
and implements Suggest
as well. Getting the inner suggesters would then have to be done with downcast-rs
and look like:
let rules = correcter.get::<Rules>(0)?;
// and mutably:
let spell = correcter.get_mut::<Spell>(1)?;
// I'm not 100% sure if this works with downcast-rs but I think it should
Extending the whitelist etc. would stay the same as before:
spell.options_mut().whitelist.push("nlprule".into());
But would not need the guard anymore (so no Drop
impl - I agree that mutable options with some guard are problematic because they hide cost from the user, this way I really don't see an issue). Setting the variant is fallible and would thus not be part of the options:
spell.set_variant("en_US")?;
The Python API would of course adjust accordingly keeping it in sync when possible.
The nice thing about all of this is that nothing is a breaking change since the Rules
keeps the same methods, there are just additional Correcter
and Spell
structs. There are multiple problems though:
.correct
and.suggest
need theTokenizer
as second argument. These will by far be the most commonly used methods so I'd like them to be as simple as possible e.g.fn correct(text: &str) -> String
andfn suggest(text: &str) -> Vec<Sugggestion>
like in the current PR without having to explicitly keep the tokenizer around if you don't need it.- The downcast magic happening in the
Correcter
is not ideal but I believe it can not be avoided. - A third binary hurts ergonomics, especially because including a binary is quite verbose at the moment.
- Spellchecking suggestions should have lower priority than suggestions from the grammar rules. This is not trivial to enforce with the
Correcter
and might require some notion of priority on theSuggestion
.
What do you think?
@@ -19,8 +21,8 @@ fn main() { | |||
env_logger::init(); | |||
let opts = Opts::parse(); | |||
|
|||
let tokenizer = Tokenizer::new(opts.tokenizer).unwrap(); | |||
let rules_container = Rules::new(opts.rules).unwrap(); | |||
let tokenizer = Arc::new(Tokenizer::new(opts.tokenizer).unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iff the above API changes are implemented, the Arc should be able to be avoided. Imho Spellcheck
should parse the file and maybe use a Tokenizer
reference only when extracting items.
let prefix = parts.next().unwrap(); | ||
let suffix = parts.next().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd strongly suggest to use debug_assert_eq(parts.clone().count(), )
and .expect()
instead.
@@ -81,6 +89,8 @@ pub enum Error { | |||
Unimplemented(String), | |||
#[error("error parsing to integer: {0}")] | |||
ParseError(#[from] ParseIntError), | |||
#[error("nlprule error: {0}")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With #[from]
adding a explicit print {0}
would duplicate the error, since the backtrace will also contain the cause
which is implicit by #[from]
.
So #[error(transparent)]
or #[error("nlprule error")]
.
Note that the review is a first glance and by no means complete, I'll have another go at the impl details tomorrow 🤞 |
.rev() | ||
.flatten() | ||
.collect(); | ||
candidates.sort_by(|(_, a), (_, b)| a.cmp(b)); | ||
if candidates.is_empty() { | ||
None | ||
} else { | ||
Some(candidates.remove(0).0) | ||
} | ||
candidates.sort_unstable(); | ||
candidates.dedup(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks a bit like https://docs.rs/itertools/0.10.0/itertools/fn.kmerge.html could be applied 🙃
pub struct PosReplacer { | ||
pub(crate) matcher: PosMatcher, | ||
} | ||
|
||
impl PosReplacer { | ||
fn apply(&self, text: &str, tokenizer: &Tokenizer) -> Option<String> { | ||
pub fn apply(&self, text: &str, tokenizer: &Tokenizer) -> Vec<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of API returns an iterator which could also just be an impl Iterator
at the cost of a few lifetime parameters, in the long run this can make a huge difference, since most collect
calls could be avoided, which ends up both being an additive cost of N and an additional allocation (avoided).
Hi, thanks for taking a look!
I was asking about the public API (thanks for your feedback on that!), the PR isn't fully ready for review. |
Ah, overlooked that. Let me know once you want a more introspective review! |
I will hold off on merging this for some time to explore better modularization first, see #50 (comment). |
Will there be a way to load from in-memory data rather than a path? I find it both inconvenient and irritating that crates like ttspico don't let me bundle the dependencies into the binary with ...especially when Cargo's solutions for making sure that non-embedded dependencies wind up where the thing looks for them across all different ways of building and running artifacts aren't the greatest. |
This PR implements spellchecking using the algorithm described in Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction. I tried multiple other methods:
I ended up with the error-tolerant FST retrieval algorithm because it plays well with the portability goals of nlprule by needing very little memory while still being quite fast.
There are a couple of issues with the current implementation:
Damerau Levenshtein distanceoptimal string alignment distance).action=ignore_spelling
in the Disambiguation rules need to be ignored.I still want to address the issues marked with a box in this PR. The others are out of scope.
It took quite some exploration of different solutions to come up with something I'm satisfied with: now the spellchecking relevant code is isolated in one module (which makes #50 easier) and is not too heavy on binary size (adds ~2.1MB, 900kb xzipped for English).