-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spellchecking #51
base: main
Are you sure you want to change the base?
Changes from all commits
582987f
b2ea80a
7f6458a
026ffb7
a0be3c2
f086fc5
0c66c78
93881da
16781b5
79ac926
9f15524
57c06e2
0a738db
f474d91
f1fdee7
b0348bb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
{ | ||
"variants": [ | ||
"de_AT", | ||
"de_DE", | ||
"de_CH" | ||
], | ||
"split_hyphens": true | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,5 +3,6 @@ | |
"ignore_ids": [ | ||
"GRAMMAR/PRP_MD_NN/2", | ||
"TYPOS/VERB_APOSTROPHE_S/3" | ||
] | ||
], | ||
"split_hyphens": true | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
{ | ||
"variants": [ | ||
"en_GB", | ||
"en_US", | ||
"en_AU" | ||
], | ||
"split_hyphens": true | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
{ | ||
"variants": [], | ||
"split_hyphens": true | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
use std::sync::Arc; | ||
|
||
use clap::Clap; | ||
use nlprule::{rules::Rules, tokenizer::Tokenizer}; | ||
|
||
|
@@ -18,11 +20,12 @@ fn main() { | |
env_logger::init(); | ||
let opts = Opts::parse(); | ||
|
||
let tokenizer = Tokenizer::new(opts.tokenizer).unwrap(); | ||
let rules = Rules::new(opts.rules).unwrap(); | ||
let tokenizer = Arc::new(Tokenizer::new(opts.tokenizer).unwrap()); | ||
let mut rules = Rules::new(opts.rules, tokenizer.clone()).unwrap(); | ||
rules.spell_mut().options_mut().variant = Some(rules.spell().variant("en_GB").unwrap()); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find this very very un-ergonomic API. A few particular points to avoid:
A more rustic API imho would be: // explicit type annotations for clarity
let spellcheck: Spellcheck = Spellcheck::new(path_to_spellcheck_rules)?;
// alt:
let spellcheck: Spellcheck = Spellcheck::from_reader(fs::File::opn(path_to_spellcheck_rules)?)?;
spellcheck.add_hunspell_dict
let previous: Option<Spellcheck> = rules.set_spell(spellcheck);
// and
rules.modify_spellcheck(move |spellcheck| Ok(spellcheck.extend_whitelist(words_iter)))? Was there a particular reason to not split the spelling rules from the tokenization rules? A third file would not hurt and avoid having to ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll think about this. In my opinion a mutable
Do you have some more information / source on this? I wouldn't have thought that this is bad.
this wouldn't compile, you need the speller to get a
This is a very valid point. The reason is that
A third binary and a function
I'd just ask the crates.io team to increase the size so you don't have to worry about this (especially since (I think) you distribute some hunspell data as well). They do it on a case by case basis (rust-lang/crates.io#195). I understand the concern, but I'd argue that it is already quite impressive to store multiple hundred thousand valid words + lemmas + part-of-speech tags, rules, a noun chunker etc. in < 10MB. An extra 900kb for spellchecking should very rarely be an issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's one alternative API moving some of the changes anticipated for #50 to this PR without going completely out of scope: A new trait trait Suggest {
fn suggest(&self, text: &str, tokenizer: &Tokenizer) -> Vec<Suggestion>; // or iterator
fn correct(&self, text: &str, tokenizer: &Tokenizer) -> String {
// default impl to just apply all the suggestions
}
} then There is one new struct struct Correcter {
suggesters: Vec<Box<dyn Suggest>>
} that can be constructed with let correcter = Correcter::new(vec![
Rules::from_reader(...).into(),
Spell::from_reader(..., "en_GB").into()
])?; // implements `FromIterator` too and implements let rules = correcter.get::<Rules>(0)?;
// and mutably:
let spell = correcter.get_mut::<Spell>(1)?;
// I'm not 100% sure if this works with downcast-rs but I think it should Extending the whitelist etc. would stay the same as before: spell.options_mut().whitelist.push("nlprule".into()); But would not need the guard anymore (so no spell.set_variant("en_US")?; The Python API would of course adjust accordingly keeping it in sync when possible. The nice thing about all of this is that nothing is a breaking change since the
What do you think? |
||
|
||
let tokens = tokenizer.pipe(&opts.text); | ||
|
||
println!("Tokens: {:#?}", tokens); | ||
println!("Suggestions: {:#?}", rules.suggest(&opts.text, &tokenizer)); | ||
println!("Suggestions: {:#?}", rules.suggest(&opts.text)); | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
use std::sync::Arc; | ||
|
||
use clap::Clap; | ||
use nlprule::{rules::Rules, tokenizer::Tokenizer}; | ||
|
||
|
@@ -19,8 +21,8 @@ fn main() { | |
env_logger::init(); | ||
let opts = Opts::parse(); | ||
|
||
let tokenizer = Tokenizer::new(opts.tokenizer).unwrap(); | ||
let rules_container = Rules::new(opts.rules).unwrap(); | ||
let tokenizer = Arc::new(Tokenizer::new(opts.tokenizer).unwrap()); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Iff the above API changes are implemented, the Arc should be able to be avoided. Imho |
||
let rules_container = Rules::new(opts.rules, tokenizer.clone()).unwrap(); | ||
let rules = rules_container.rules(); | ||
|
||
println!("Runnable rules: {}", rules.len()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the language specific affix file instead of hardcoding it, it saves yourself and a few other people a lot of pain :)