Token as returned by pipe() is relative to the sentence boundaries #53

drahnr · 2021-03-17T16:07:53Z

// Token<'_>
    pub char_span: (usize, usize),
    pub byte_span: (usize, usize),

using fn pipe() returns a set of tokens, that includes spans relative to the sentence, but there seems to be no trivial way of retrieving the spans from within the original text provided to pipe.

Suggestion: Use a Range<usize> instead of a tuple for the relevant range of bytes/ characters for easier usage and make that relative to the input text.

Since for single sentences, there is no change in semantics. For multi sentence ones there is.

It would also make sense to add the respective bounds in bytes and chars of the sentence (or replace the sentence entirely).

pub sentence: &'t str,

Related cargo spellcheck issue drahnr/cargo-spellcheck#162

The text was updated successfully, but these errors were encountered:

bminixhofer · 2021-03-18T08:08:29Z

Thanks. That should definitely be fixed. The char_span and byte_span fields are used in a couple of places internally, but shifting them by a constant amount shouldn't make a difference.

Similarly, the sentence is also needed internally so it can't just be removed. I think the best solution would be making .pipe return an iterator over Sentences, where each sentence is an iterator over Tokens and stores some metadata (e.g. sentence text, tagger). This would also make it possible to make the annoying SENT_START token invisible to the user.

drahnr · 2021-03-18T10:33:44Z

Assuming you are planning to impl this since you self assigned (thanks!) - do you have an idea how long this will take? I'd be happy to take this on tonight since this is required to resolve some production fallout with cargo-spellcheck's current (shit) tokenization.

bminixhofer · 2021-03-18T11:16:29Z

Yes, I thought about a Sentence wrapper around tokens (and a related struct for token slices) for some time and since changing the char_span and byte_span is a breaking change I'd like to fix both in one go.

I can prioritize this, in that case it I'd estimate to land it latest on Monday - if it's not working by then I'll just fix the char_span and byte_span and release.

In the meantime you can work around it in a semi-hacky way:

let text = "A täst. Another test.";

for sentence in tokenizer.pipe(text) {
    for token in sentence {
        let offset_bytes = token.sentence.as_ptr() as usize - text.as_ptr() as usize;
        let offset_chars = text[..offset_bytes].chars().count();

        let fixed_char_span = (
            token.char_span.0 + offset_chars,
            token.char_span.1 + offset_chars,
        );
        let fixed_byte_span = (
            token.byte_span.0 + offset_bytes,
            token.byte_span.1 + offset_bytes,
        );
        println!(
            "{} {:?} {:?}",
            token.word.text.as_ref(),
            fixed_char_span,
            fixed_byte_span
        );
    }
}

drahnr · 2021-03-18T12:46:11Z

It bumps the complexity quite a bit especially for a fn that is called quite frequently - but a good bandaid for now! Thanks!

drahnr · 2021-03-25T15:07:51Z

I'll have some limited time on the we if you won't get around to it until then 🙃

bminixhofer · 2021-03-25T17:01:19Z

Hi, sorry for not sticking with the timeline I gave above, I was busy this week. There's now a PR for this: #54. There's some things still to do but I should be done soon. I'd definitely appreciate a review though! (once it is ready).

drahnr mentioned this issue Mar 17, 2021

better tokenization drahnr/cargo-spellcheck#162

Merged

3 tasks

bminixhofer self-assigned this Mar 18, 2021

bminixhofer linked a pull request Mar 28, 2021 that will close this issue

Add a Sentence struct, replace Vec<Token> with Sentence where possible #54

Merged

2 tasks

bminixhofer closed this as completed in #54 Mar 30, 2021

drahnr mentioned this issue Apr 1, 2021

Usability of the rules API degraded from 0.4.6 to 0.5.1 #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token as returned by pipe() is relative to the sentence boundaries #53

Token as returned by pipe() is relative to the sentence boundaries #53

drahnr commented Mar 17, 2021 •

edited

Loading

bminixhofer commented Mar 18, 2021

drahnr commented Mar 18, 2021 •

edited

Loading

bminixhofer commented Mar 18, 2021 •

edited

Loading

drahnr commented Mar 18, 2021

drahnr commented Mar 25, 2021

bminixhofer commented Mar 25, 2021

Token as returned by pipe() is relative to the sentence boundaries #53

Token as returned by pipe() is relative to the sentence boundaries #53

Comments

drahnr commented Mar 17, 2021 • edited Loading

bminixhofer commented Mar 18, 2021

drahnr commented Mar 18, 2021 • edited Loading

bminixhofer commented Mar 18, 2021 • edited Loading

drahnr commented Mar 18, 2021

drahnr commented Mar 25, 2021

bminixhofer commented Mar 25, 2021

drahnr commented Mar 17, 2021 •

edited

Loading

drahnr commented Mar 18, 2021 •

edited

Loading

bminixhofer commented Mar 18, 2021 •

edited

Loading