Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing error thrown when tokens is empty #33

Closed
tomaarsen opened this issue Sep 22, 2023 · 0 comments · Fixed by #40
Closed

Confusing error thrown when tokens is empty #33

tomaarsen opened this issue Sep 22, 2023 · 0 comments · Fixed by #40
Labels
enhancement New feature or request

Comments

@tomaarsen
Copy link
Owner

When one of the elements in the training set is empty, then it ends up throwing a confusing error:

Label normalizing the train dataset: 100%|██████████████████████████████████████████████████████████████████████| 8324/8324 [00:00<00:00, 34016.14 examples/s]
Tokenizing the train dataset:  96%|██████████████████████████████████████████████████████████████████████████▉   | 8000/8324 [00:04<00:00, 1665.71 examples/s]c:\code\span-marker-ner\span_marker\tokenizer.py:204: RuntimeWarning: All-NaN slice encountered
  num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
Tokenizing the train dataset:  96%|██████████████████████████████████████████████████████████████████████████▉   | 8000/8324 [00:04<00:00, 1612.60 examples/s] 
This SpanMarker model will ignore 3.181189% of all annotated entities in the train dataset. This is caused by the SpanMarkerModel maximum entity length of 5 words and the maximum model input length of 256 tokens.
These are the frequencies of the missed entities due to maximum entity length out of 18798 total entities:
- 203 missed entities with 6 words (1.079902%)
- 81 missed entities with 7 words (0.430897%)
- 58 missed entities with 8 words (0.308543%)
- 29 missed entities with 9 words (0.154272%)
- 5 missed entities with 10 words (0.026599%)
- 9 missed entities with 11 words (0.047877%)
- 8 missed entities with 12 words (0.042558%)
- 1 missed entities with 13 words (0.005320%)
- 1 missed entities with 14 words (0.005320%)
- 1 missed entities with 15 words (0.005320%)
- 2 missed entities with 16 words (0.010639%)
- 1 missed entities with 17 words (0.005320%)
Additionally, a total of 199 (1.058623%) entities were missed due to the maximum input length.
Traceback (most recent call last):
  File "c:\code\span-marker-ner\demo_conll2002.py", line 83, in <module>
    main()
  File "c:\code\span-marker-ner\demo_conll2002.py", line 72, in main
    trainer.train()
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\transformers\trainer.py", line 1553, in train
    return inner_training_loop(
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\transformers\trainer.py", line 1567, in _inner_training_loop
    train_dataloader = self.get_train_dataloader()
  File "c:\code\span-marker-ner\span_marker\trainer.py", line 423, in get_train_dataloader
    self.train_dataset = self.preprocess_dataset(self.train_dataset, self.label_normalizer, self.tokenizer)
  File "c:\code\span-marker-ner\span_marker\trainer.py", line 241, in preprocess_dataset
    dataset = dataset.map(
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3097, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3474, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "c:\code\span-marker-ner\span_marker\tokenizer.py", line 204, in __call__
    num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
ValueError: cannot convert float NaN to integer

Perhaps a cleaner error can be designed here.

  • Tom Aarsen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant