Tokenizer doesn't respect combined_electra-large's max_length #1294

rmalouf · 2023-10-09T03:26:15Z

Describe the bug
When parsing a long text using the latest "combined_electra-large" model, I get the error:

Token indices sequence length is longer than the specified maximum sequence length for this
model (630 > 512). Running this sequence through the model will result in indexing errors
Exception in thread parse_chunks:
Traceback (most recent call last):
  File "/home1/malouf/.pyenv/versions/3.11.3/lib/python3.11/threading.py", line 1038, in 
_bootstrap_inner
    self.run()
  File "/home1/malouf/batch/treebank/threadpipe.py", line 113, in run
    for tag, result in zip(tags, self.function(items)):
  File "/home1/malouf/batch/treebank/parse.py", line 125, in parse_chunks
    for doc_id, doc in zip(
  File 
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/core.py", 
line 456, in stream
    batch = self.bulk_process(batch, *args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File 
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/core.py", 
line 433, in bulk_process
    return self.process(docs, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File 
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/core.py",
line 422, in process
    doc = process(doc)
          ^^^^^^^^^^^^
  File
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/processor.
py", line 258, in bulk_process
    self.process(combined_doc) # annotations are attached to sentence objects
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/pos_proces
sor.py", line 84, in process
    batch.doc.set([doc.UPOS, doc.XPOS, doc.FEATS], [y for x in preds for y in x])   
  File
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/models/common/doc.p
y", line 254, in set
    assert (to_token and self.num_tokens == len(contents)) or self.num_words == len(contents),
\
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Contents must have the same length as the original file.

Environment (please complete the following information):

OS: MacOS 14.0
Python version: Python 3.11.3
Stanza version: 1.6.1 (and transformers 4.34.0)

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2023-10-09T18:31:47Z

Yes, this is a known issue. We either need to use a transformer that allows a bigger window, or in some way combine representations to get decent results from a longer sentence. The biggest reason we can't simply use two consecutive iterations of the transformer is the second half of the sequence would treat a word in the middle of the sentence as the start of the sentence, considering the way the transformer positional encodings work. We hope to address this by the end of the year, but there are several things in our task list which need handling. A simple enough stopgap might be to fall back to the non-transformer model for sentences which are too long. In the meantime, you might consider discarding sentences which are that long.

rmalouf · 2023-10-09T19:13:05Z

I'd be fine with truncating or discarding long sentences for now, but unfortunately I can't tell that they're too long until after the text is tokenized. Is there an easy built-in way to truncate sentences mid-pipeline, or will I need to add a custom processor?

AngledLuffa · 2023-10-10T20:05:26Z

I've got a rather high priority thing to work on today and tomorrow, but I can try to have a thing ready by Friday which at least avoids the crash

…

On Mon, Oct 9, 2023 at 12:13 PM Rob Malouf ***@***.***> wrote: I'd be fine with truncating or discarding long sentences for now, but unfortunately I can't tell that they're too long until after the text is tokenized. Is there an easy built-in way to truncate sentences mid-pipeline, or will I need to add a custom processor? — Reply to this email directly, view it on GitHub <#1294 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWJKOP7ZNKAJOPVIZ5DX6REEZAVCNFSM6AAAAAA5YFI4SWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJTGUZTQNZYGE> . You are receiving this because you commented.Message ID: ***@***.***>

rmalouf · 2023-10-10T20:37:59Z

Thanks, that's very generous! But you can hold off -- I'll take a stab at it first and come back next week if I can't make it work.

BLKSerene · 2023-12-19T03:09:09Z

Hi, I also get this error. Any updates or workarounds on this?

AngledLuffa · 2023-12-19T08:08:29Z

Let me see if I can get to it this winter break

…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294

rmalouf · 2024-05-22T03:29:47Z

I'm still running into this in stanza 1.8.2. An offending text fragment is:

Call Government Securities TOTAL . 31 December 1845 3,590,014 563,072 628,500 1,039,745 2,231,317 31 December 1846 3,280,864 634,575 423,060 938,717 1,996,352 31 December 1847 2,733,753 7,231,325 350,108 791,899 1,863,332 30 June 1848 3,170,118 588,871 159,724 1,295,047 2,043,642 31 December 1848 3,089,659 645,468 176,824 1,189,213 2,011,505 30 June 1849 3,392,857 552,642 246,494 964,800 1,763,936 31 December 1849 3,680,623 686,761 264,577 973,691 1,224,029 30 June 1850 3,821,022 654,649 258,177 972,055 1,884,881 31 December 1850 3,969,648 566,039 334,982 1,089,794 1,990,815 30 June 1851 4,414,179 691,719 424,195 1,054,018 2,169,932 31 December 1851 4,677,298 653,946 378,337 1,054,018 2,080,301 30 June 1852 5,245,135 861,778 136,687 1,054,018 2,122,483 31 December 1852 5,581,706 855,057 397,087 1,119,477 2,371,621 30 June 1853 6,219,817 904,252 499,467 1,218,852 2,622,571 31 December 1853 6,259,540 791,699 677,392 1,468,902 2,937,993 30 June 1854 6,892,470 827,397 917,557 1,457,415 3,202,369 31 December 1854 7,177,244 694,309 486,400 1,451,074 2,631,783 30 June 1855 8,166,553 722,243 483,890 1,754,074 2,960,207 31 December 1855 8,744,095 847,856 451,575 1,949,074 3,248,505 30 June 1856 11,170,010 906,876 601,800 1,980,489 3,489,165 31 December 1856 11,438,461 1,119,591 432,000 2,922,625 4,474,216 30 June 1857 13,913,058 967,078 687,730 3,353,179 5,007,987 31 December 1857 113,889,021 2,226,441 1,115,883 3,582,797 6,923,121 1191

Obviously I'm not expecting to get a useful parse of that. I'd just like the stream to not crash so I can continue processing text chunks.

AngledLuffa · 2024-05-23T23:19:36Z

Are you getting a different exception, though? I get the following log & traceback:

Token indices sequence length is longer than the specified maximum sequence length for this model (715 > 512). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/john/stanza/stanza/pipeline/core.py", line 480, in __call__
    return self.process(doc, processors)
  File "/home/john/stanza/stanza/pipeline/core.py", line 431, in process
    doc = process(doc)
  File "/home/john/stanza/stanza/pipeline/pos_processor.py", line 91, in process
    dataset.doc.set([doc.UPOS, doc.XPOS, doc.FEATS], [y for x in preds for y in x])
  File "/home/john/stanza/stanza/models/common/doc.py", line 303, in set
    assert (to_token and self.num_tokens == len(contents)) or self.num_words == len(contents), \
AssertionError: Contents must have the same length as the original file.

…udging for bert. #1294

rmalouf · 2024-05-24T15:23:11Z

Oh, you're right! I didn't look closely enough. First and last lines are the same but it's a different assertion that's failing. Sorry about that.

…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294

…udging for bert. #1294

rmalouf added the bug label Oct 9, 2023

AngledLuffa added a commit that referenced this issue Feb 2, 2024

Initial attempt to chop up long inputs to a transformer into pieces t…

b6dbfc8

…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294

AngledLuffa added a commit that referenced this issue Feb 2, 2024

Initial attempt to chop up long inputs to a transformer into pieces t…

cce405c

…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294

AngledLuffa added a commit that referenced this issue Feb 2, 2024

Initial attempt to chop up long inputs to a transformer into pieces t…

bbe90ee

…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294

AngledLuffa added a commit that referenced this issue Feb 2, 2024

Initial attempt to chop up long inputs to a transformer into pieces t…

ec38d02

…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294

AngledLuffa added a commit that referenced this issue Feb 3, 2024

Initial attempt to chop up long inputs to a transformer into pieces t…

9e2b5ef

…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294

AngledLuffa added a commit that referenced this issue Feb 3, 2024

Initial attempt to chop up long inputs to a transformer into pieces t…

1f83b5c

…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294

AngledLuffa mentioned this issue Feb 24, 2024

Initial attempt to chop up long inputs to a transformer into pieces t… #1350

Merged

AngledLuffa added a commit that referenced this issue Feb 24, 2024

Initial attempt to chop up long inputs to a transformer into pieces t…

3e21404

…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294

AngledLuffa added a commit that referenced this issue May 24, 2024

for pos/ner/depparse don't need to filter if we're using the length f…

d2ffa39

…udging for bert. #1294

Jemoka pushed a commit that referenced this issue Jul 16, 2024

Initial attempt to chop up long inputs to a transformer into pieces t…

f58a5af

…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294

Jemoka pushed a commit that referenced this issue Jul 16, 2024

for pos/ner/depparse don't need to filter if we're using the length f…

8c63d17

…udging for bert. #1294

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer doesn't respect combined_electra-large's max_length #1294

Tokenizer doesn't respect combined_electra-large's max_length #1294

rmalouf commented Oct 9, 2023

AngledLuffa commented Oct 9, 2023 via email

rmalouf commented Oct 9, 2023

AngledLuffa commented Oct 10, 2023 via email

rmalouf commented Oct 10, 2023

BLKSerene commented Dec 19, 2023

AngledLuffa commented Dec 19, 2023 via email

rmalouf commented May 22, 2024

AngledLuffa commented May 23, 2024

rmalouf commented May 24, 2024

Tokenizer doesn't respect combined_electra-large's max_length #1294

Tokenizer doesn't respect combined_electra-large's max_length #1294

Comments

rmalouf commented Oct 9, 2023

AngledLuffa commented Oct 9, 2023 via email

rmalouf commented Oct 9, 2023

AngledLuffa commented Oct 10, 2023 via email

rmalouf commented Oct 10, 2023

BLKSerene commented Dec 19, 2023

AngledLuffa commented Dec 19, 2023 via email

rmalouf commented May 22, 2024

AngledLuffa commented May 23, 2024

rmalouf commented May 24, 2024