Releases: bminixhofer/nlprule
Releases · bminixhofer/nlprule
Release 0.4.5
New features
- A
transform
function innlprule-build
to transform binaries immediately after acquiring them. Suited for e. g. compressing the binaries before caching them.
Fixes
- Require
srx=^0.1.2
to include a patch for out of bounds access.
Release 0.4.4
Breaking changes
This is a patch release but there are some small breaking changes to the public API:
from_reader
andnew
methods of theTokenizer
andRules
now return annlprule::Error
instead ofbincode:Error
.tag_store
andword_store
methods of theTagger
are now private.
New features
- The
nlprule-build
crate now has apostprocess
method to allow e.g. compression of the produced binaries (#32, thanks @drahnr!).
Internal improvements
- Newtypes for
PosIdInt
andWordIdInt
to clarify use of ids in the tagger (#31). - Newtype for indices into the match graph (
GraphId
). All graph ids are validated at build-time now (also fixed an error where invalid graph ids in the XML files were ignored through this) (#31). - Reduced size of the English tokenizer through better serialization of the chunker. From 15MB (7.7MB gzipped) to 11MB (6.9MB gzipped).
- Reduce allocations through making more use of iterators internally (#30). Improves speed but there is no significant benchmark improvement on my machine.
- Improve error handling by propagating more errors in the
compile
module instead of panicking and better build-time validation. Reducesunwrap
s from ~80 to ~40.
Release 0.4.3
Breaking changes
nlprule
does sentence segmentation internally now using srx. The Python API has changed, removing theSplitOn
class and the*_sentence
methods:
tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer)
rules.correct("He wants that you send him an email.") # this takes an arbitrary text
new_from
is now calledfrom_reader
in the Rust API (thanks @drahnr!)Token.text
andIncompleteToken.text
are now calledToken.sentence
/IncompleteToken.sentence
to avoid confusion withToken.word.text
.Tokenizer.tokenize
is now private. UseTokenizer.pipe
instead (also does sentence segmentation).
New features
- Support for Spanish (experimental).
- A new multiword tagger improves tagging of e. g. named entities for English and Spanish.
- Adds the
nlprule-build
crate which makes using the correct binaries in Rust easier (thanks @drahnr for the suggestion and discussion!) - Scripts and docs in
build/README.md
to make creating the nlprule build directories easier and more reproducible. - Full support for LanguageTool unifications.
- Binary size of the
Tokenizer
improved a lot. Now roughly x6 smaller for German and x2 smaller for English. - New iterator helpers for
Rules
(thanks @drahnr!) - A method
.sentencize
on theTokenizer
which does only sentence segmentation and nothing else.
Release 0.4.0
fix build.rs recommendation
Release 0.3.0
BREAKING: suggestion.text
is now more accurately called suggestion.replacements
Lots of speed improvements: NLPRule is now roughly 2.5x to 5x faster for German and English, respectively.
Rules have more information in the public API now: See #5
0.2.2
0.2.1
Fix precedence of Rule IDs over Rule Group IDs.
0.2.0
- Updated to LT version 5.2.
- Suggestions now have a
message
andsource
attribute (#5):
suggestions = rules.suggest_sentence("She was not been here since Monday.")
for s in suggestions:
print(s.start, s.end, s.text, s.source, s.message)
# prints:
# 4 16 ['was not', 'has not been'] WAS_BEEN.1 Did you mean was not or has not been?
- NLPRule is parallelized by default now. Parallelism can be turned off by setting the
NLPRULE_PARALLELISM
environment variable to false.
Release 0.1.9
Testing new distribution of binaries.
Release 0.1.8
Testing new distribution of binaries.