Improve the error when passing an empty sentence to the tokenizer #40

tomaarsen · 2023-10-31T09:37:29Z

Closes #33

Hello!

Pull Request overview

Improve the error when passing an empty sentence to the tokenizer
- test

Details

Before, if you provided an empty sentence to the tokenizer during training or predicting, you would get this error:

from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")
entities = model.predict([""])

[sic]\span_marker\tokenizer.py:205: RuntimeWarning: All-NaN slice encountered
  num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
Traceback (most recent call last):
  File "[sic]\demo_38.py", line 3, in <module>
    entities = model.predict([""])
  File "[sic]\span_marker\modeling.py", line 459, in predict
    tokenizer_dict = self.tokenizer(
  File "[sic]\span_marker\tokenizer.py", line 205, in __call__
    num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
ValueError: cannot convert float NaN to integer

Now, you get the following error instead:

[sic]\span_marker\tokenizer.py:209: RuntimeWarning: All-NaN slice encountered
  max_word_ids = np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))
Traceback (most recent call last):
  File "[sic]\demo_38.py", line 3, in <module>
    entities = model.predict([""])
  File "[sic]\span_marker\modeling.py", line 459, in predict
    tokenizer_dict = self.tokenizer(
  File "[sic]\span_marker\tokenizer.py", line 211, in __call__
    raise ValueError("The `SpanMarkerTokenizer` detected an empty sentence, please remove it.")
ValueError: The `SpanMarkerTokenizer` detected an empty sentence, please remove it.

Tom Aarsen

…o ux/better_empty_sentence_error

tomaarsen added 4 commits October 31, 2023 10:32

Add clearer error when an empty sentence is passed

bb7a37e

Merge branch 'main' of https://github.com/tomaarsen/SpanMarkerNER int…

4666676

…o ux/better_empty_sentence_error

Update changelog

fc224ea

Add test case

ed41bc3

tomaarsen merged commit 38bee88 into main Oct 31, 2023
8 checks passed

tomaarsen deleted the ux/better_empty_sentence_error branch October 31, 2023 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the error when passing an empty sentence to the tokenizer #40

Improve the error when passing an empty sentence to the tokenizer #40

tomaarsen commented Oct 31, 2023 •

edited

Loading

Improve the error when passing an empty sentence to the tokenizer #40

Improve the error when passing an empty sentence to the tokenizer #40

Conversation

tomaarsen commented Oct 31, 2023 • edited Loading

Pull Request overview

Details

tomaarsen commented Oct 31, 2023 •

edited

Loading