Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the error when passing an empty sentence to the tokenizer #40

Merged
merged 4 commits into from
Oct 31, 2023

Conversation

tomaarsen
Copy link
Owner

@tomaarsen tomaarsen commented Oct 31, 2023

Closes #33

Hello!

Pull Request overview

  • Improve the error when passing an empty sentence to the tokenizer
    • test

Details

Before, if you provided an empty sentence to the tokenizer during training or predicting, you would get this error:

from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")
entities = model.predict([""])
[sic]\span_marker\tokenizer.py:205: RuntimeWarning: All-NaN slice encountered
  num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
Traceback (most recent call last):
  File "[sic]\demo_38.py", line 3, in <module>
    entities = model.predict([""])
  File "[sic]\span_marker\modeling.py", line 459, in predict
    tokenizer_dict = self.tokenizer(
  File "[sic]\span_marker\tokenizer.py", line 205, in __call__
    num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
ValueError: cannot convert float NaN to integer

Now, you get the following error instead:

[sic]\span_marker\tokenizer.py:209: RuntimeWarning: All-NaN slice encountered
  max_word_ids = np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))
Traceback (most recent call last):
  File "[sic]\demo_38.py", line 3, in <module>
    entities = model.predict([""])
  File "[sic]\span_marker\modeling.py", line 459, in predict
    tokenizer_dict = self.tokenizer(
  File "[sic]\span_marker\tokenizer.py", line 211, in __call__
    raise ValueError("The `SpanMarkerTokenizer` detected an empty sentence, please remove it.")
ValueError: The `SpanMarkerTokenizer` detected an empty sentence, please remove it.
  • Tom Aarsen

@tomaarsen tomaarsen merged commit 38bee88 into main Oct 31, 2023
8 checks passed
@tomaarsen tomaarsen deleted the ux/better_empty_sentence_error branch October 31, 2023 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Confusing error thrown when tokens is empty
1 participant