Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token as returned by pipe() is relative to the sentence boundaries #53

Closed
drahnr opened this issue Mar 17, 2021 · 6 comments · Fixed by #54
Closed

Token as returned by pipe() is relative to the sentence boundaries #53

drahnr opened this issue Mar 17, 2021 · 6 comments · Fixed by #54
Assignees

Comments

@drahnr
Copy link
Contributor

drahnr commented Mar 17, 2021

// Token<'_>
    pub char_span: (usize, usize),
    pub byte_span: (usize, usize),

using fn pipe() returns a set of tokens, that includes spans relative to the sentence, but there seems to be no trivial way of retrieving the spans from within the original text provided to pipe.

Suggestion: Use a Range<usize> instead of a tuple for the relevant range of bytes/ characters for easier usage and make that relative to the input text.

Since for single sentences, there is no change in semantics. For multi sentence ones there is.

It would also make sense to add the respective bounds in bytes and chars of the sentence (or replace the sentence entirely).

pub sentence: &'t str,

Related cargo spellcheck issue drahnr/cargo-spellcheck#162

@bminixhofer
Copy link
Owner

Thanks. That should definitely be fixed. The char_span and byte_span fields are used in a couple of places internally, but shifting them by a constant amount shouldn't make a difference.

Similarly, the sentence is also needed internally so it can't just be removed. I think the best solution would be making .pipe return an iterator over Sentences, where each sentence is an iterator over Tokens and stores some metadata (e.g. sentence text, tagger). This would also make it possible to make the annoying SENT_START token invisible to the user.

@bminixhofer bminixhofer self-assigned this Mar 18, 2021
@drahnr
Copy link
Contributor Author

drahnr commented Mar 18, 2021

Assuming you are planning to impl this since you self assigned (thanks!) - do you have an idea how long this will take? I'd be happy to take this on tonight since this is required to resolve some production fallout with cargo-spellcheck's current (shit) tokenization.

@bminixhofer
Copy link
Owner

bminixhofer commented Mar 18, 2021

Yes, I thought about a Sentence wrapper around tokens (and a related struct for token slices) for some time and since changing the char_span and byte_span is a breaking change I'd like to fix both in one go.

I can prioritize this, in that case it I'd estimate to land it latest on Monday - if it's not working by then I'll just fix the char_span and byte_span and release.

In the meantime you can work around it in a semi-hacky way:

let text = "A täst. Another test.";

for sentence in tokenizer.pipe(text) {
    for token in sentence {
        let offset_bytes = token.sentence.as_ptr() as usize - text.as_ptr() as usize;
        let offset_chars = text[..offset_bytes].chars().count();

        let fixed_char_span = (
            token.char_span.0 + offset_chars,
            token.char_span.1 + offset_chars,
        );
        let fixed_byte_span = (
            token.byte_span.0 + offset_bytes,
            token.byte_span.1 + offset_bytes,
        );
        println!(
            "{} {:?} {:?}",
            token.word.text.as_ref(),
            fixed_char_span,
            fixed_byte_span
        );
    }
}

@drahnr
Copy link
Contributor Author

drahnr commented Mar 18, 2021

It bumps the complexity quite a bit especially for a fn that is called quite frequently - but a good bandaid for now! Thanks!

@drahnr
Copy link
Contributor Author

drahnr commented Mar 25, 2021

I'll have some limited time on the we if you won't get around to it until then 🙃

@bminixhofer
Copy link
Owner

Hi, sorry for not sticking with the timeline I gave above, I was busy this week. There's now a PR for this: #54. There's some things still to do but I should be done soon. I'd definitely appreciate a review though! (once it is ready).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants