Confusing error thrown when tokens is empty #33

tomaarsen · 2023-09-22T16:23:58Z

When one of the elements in the training set is empty, then it ends up throwing a confusing error:

Label normalizing the train dataset: 100%|██████████████████████████████████████████████████████████████████████| 8324/8324 [00:00<00:00, 34016.14 examples/s]
Tokenizing the train dataset:  96%|██████████████████████████████████████████████████████████████████████████▉   | 8000/8324 [00:04<00:00, 1665.71 examples/s]c:\code\span-marker-ner\span_marker\tokenizer.py:204: RuntimeWarning: All-NaN slice encountered
  num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
Tokenizing the train dataset:  96%|██████████████████████████████████████████████████████████████████████████▉   | 8000/8324 [00:04<00:00, 1612.60 examples/s] 
This SpanMarker model will ignore 3.181189% of all annotated entities in the train dataset. This is caused by the SpanMarkerModel maximum entity length of 5 words and the maximum model input length of 256 tokens.
These are the frequencies of the missed entities due to maximum entity length out of 18798 total entities:
- 203 missed entities with 6 words (1.079902%)
- 81 missed entities with 7 words (0.430897%)
- 58 missed entities with 8 words (0.308543%)
- 29 missed entities with 9 words (0.154272%)
- 5 missed entities with 10 words (0.026599%)
- 9 missed entities with 11 words (0.047877%)
- 8 missed entities with 12 words (0.042558%)
- 1 missed entities with 13 words (0.005320%)
- 1 missed entities with 14 words (0.005320%)
- 1 missed entities with 15 words (0.005320%)
- 2 missed entities with 16 words (0.010639%)
- 1 missed entities with 17 words (0.005320%)
Additionally, a total of 199 (1.058623%) entities were missed due to the maximum input length.
Traceback (most recent call last):
  File "c:\code\span-marker-ner\demo_conll2002.py", line 83, in <module>
    main()
  File "c:\code\span-marker-ner\demo_conll2002.py", line 72, in main
    trainer.train()
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\transformers\trainer.py", line 1553, in train
    return inner_training_loop(
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\transformers\trainer.py", line 1567, in _inner_training_loop
    train_dataloader = self.get_train_dataloader()
  File "c:\code\span-marker-ner\span_marker\trainer.py", line 423, in get_train_dataloader
    self.train_dataset = self.preprocess_dataset(self.train_dataset, self.label_normalizer, self.tokenizer)
  File "c:\code\span-marker-ner\span_marker\trainer.py", line 241, in preprocess_dataset
    dataset = dataset.map(
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3097, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3474, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "c:\code\span-marker-ner\span_marker\tokenizer.py", line 204, in __call__
    num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
ValueError: cannot convert float NaN to integer

Perhaps a cleaner error can be designed here.

Tom Aarsen

The text was updated successfully, but these errors were encountered:

tomaarsen added the enhancement New feature or request label Sep 22, 2023

tomaarsen mentioned this issue Oct 31, 2023

Improve the error when passing an empty sentence to the tokenizer #40

Merged

tomaarsen closed this as completed in #40 Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusing error thrown when tokens is empty #33

Confusing error thrown when tokens is empty #33

tomaarsen commented Sep 22, 2023

Confusing error thrown when tokens is empty #33

Confusing error thrown when tokens is empty #33

Comments

tomaarsen commented Sep 22, 2023