You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When one of the elements in the training set is empty, then it ends up throwing a confusing error:
Label normalizing the train dataset: 100%|██████████████████████████████████████████████████████████████████████| 8324/8324 [00:00<00:00, 34016.14 examples/s]
Tokenizing the train dataset: 96%|██████████████████████████████████████████████████████████████████████████▉ | 8000/8324 [00:04<00:00, 1665.71 examples/s]c:\code\span-marker-ner\span_marker\tokenizer.py:204: RuntimeWarning: All-NaN slice encountered
num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
Tokenizing the train dataset: 96%|██████████████████████████████████████████████████████████████████████████▉ | 8000/8324 [00:04<00:00, 1612.60 examples/s]
This SpanMarker model will ignore 3.181189% of all annotated entities in the train dataset. This is caused by the SpanMarkerModel maximum entity length of 5 words and the maximum model input length of 256 tokens.
These are the frequencies of the missed entities due to maximum entity length out of 18798 total entities:
- 203 missed entities with 6 words (1.079902%)
- 81 missed entities with 7 words (0.430897%)
- 58 missed entities with 8 words (0.308543%)
- 29 missed entities with 9 words (0.154272%)
- 5 missed entities with 10 words (0.026599%)
- 9 missed entities with 11 words (0.047877%)
- 8 missed entities with 12 words (0.042558%)
- 1 missed entities with 13 words (0.005320%)
- 1 missed entities with 14 words (0.005320%)
- 1 missed entities with 15 words (0.005320%)
- 2 missed entities with 16 words (0.010639%)
- 1 missed entities with 17 words (0.005320%)
Additionally, a total of 199 (1.058623%) entities were missed due to the maximum input length.
Traceback (most recent call last):
File "c:\code\span-marker-ner\demo_conll2002.py", line 83, in <module>
main()
File "c:\code\span-marker-ner\demo_conll2002.py", line 72, in main
trainer.train()
File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\transformers\trainer.py", line 1553, in train
return inner_training_loop(
File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\transformers\trainer.py", line 1567, in _inner_training_loop
train_dataloader = self.get_train_dataloader()
File "c:\code\span-marker-ner\span_marker\trainer.py", line 423, in get_train_dataloader
self.train_dataset = self.preprocess_dataset(self.train_dataset, self.label_normalizer, self.tokenizer)
File "c:\code\span-marker-ner\span_marker\trainer.py", line 241, in preprocess_dataset
dataset = dataset.map(
File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 592, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3097, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3474, in _map_single
batch = apply_function_on_filtered_inputs(
File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "c:\code\span-marker-ner\span_marker\tokenizer.py", line 204, in __call__
num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
ValueError: cannot convert float NaN to integer
Perhaps a cleaner error can be designed here.
Tom Aarsen
The text was updated successfully, but these errors were encountered:
When one of the elements in the training set is empty, then it ends up throwing a confusing error:
Perhaps a cleaner error can be designed here.
The text was updated successfully, but these errors were encountered: