Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer doesn't respect combined_electra-large's max_length #1294

Open
rmalouf opened this issue Oct 9, 2023 · 9 comments
Open

Tokenizer doesn't respect combined_electra-large's max_length #1294

rmalouf opened this issue Oct 9, 2023 · 9 comments
Labels

Comments

@rmalouf
Copy link

rmalouf commented Oct 9, 2023

Describe the bug
When parsing a long text using the latest "combined_electra-large" model, I get the error:

Token indices sequence length is longer than the specified maximum sequence length for this
model (630 > 512). Running this sequence through the model will result in indexing errors
Exception in thread parse_chunks:
Traceback (most recent call last):
  File "/home1/malouf/.pyenv/versions/3.11.3/lib/python3.11/threading.py", line 1038, in 
_bootstrap_inner
    self.run()
  File "/home1/malouf/batch/treebank/threadpipe.py", line 113, in run
    for tag, result in zip(tags, self.function(items)):
  File "/home1/malouf/batch/treebank/parse.py", line 125, in parse_chunks
    for doc_id, doc in zip(
  File 
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/core.py", 
line 456, in stream
    batch = self.bulk_process(batch, *args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File 
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/core.py", 
line 433, in bulk_process
    return self.process(docs, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File 
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/core.py",
line 422, in process
    doc = process(doc)
          ^^^^^^^^^^^^
  File
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/processor.
py", line 258, in bulk_process
    self.process(combined_doc) # annotations are attached to sentence objects
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/pos_proces
sor.py", line 84, in process
    batch.doc.set([doc.UPOS, doc.XPOS, doc.FEATS], [y for x in preds for y in x])   
  File
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/models/common/doc.p
y", line 254, in set
    assert (to_token and self.num_tokens == len(contents)) or self.num_words == len(contents),
\
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Contents must have the same length as the original file.

Environment (please complete the following information):

  • OS: MacOS 14.0
  • Python version: Python 3.11.3
  • Stanza version: 1.6.1 (and transformers 4.34.0)
@rmalouf rmalouf added the bug label Oct 9, 2023
@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Oct 9, 2023 via email

@rmalouf
Copy link
Author

rmalouf commented Oct 9, 2023

I'd be fine with truncating or discarding long sentences for now, but unfortunately I can't tell that they're too long until after the text is tokenized. Is there an easy built-in way to truncate sentences mid-pipeline, or will I need to add a custom processor?

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Oct 10, 2023 via email

@rmalouf
Copy link
Author

rmalouf commented Oct 10, 2023

Thanks, that's very generous! But you can hold off -- I'll take a stab at it first and come back next week if I can't make it work.

@BLKSerene
Copy link
Contributor

Hi, I also get this error. Any updates or workarounds on this?

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Dec 19, 2023 via email

AngledLuffa added a commit that referenced this issue Feb 2, 2024
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
AngledLuffa added a commit that referenced this issue Feb 2, 2024
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
AngledLuffa added a commit that referenced this issue Feb 2, 2024
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
AngledLuffa added a commit that referenced this issue Feb 2, 2024
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
AngledLuffa added a commit that referenced this issue Feb 3, 2024
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
AngledLuffa added a commit that referenced this issue Feb 3, 2024
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
AngledLuffa added a commit that referenced this issue Feb 24, 2024
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
@rmalouf
Copy link
Author

rmalouf commented May 22, 2024

I'm still running into this in stanza 1.8.2. An offending text fragment is:

Call Government Securities TOTAL . 31 December 1845 3,590,014 563,072 628,500 1,039,745 2,231,317 31 December 1846 3,280,864 634,575 423,060 938,717 1,996,352 31 December 1847 2,733,753 7,231,325 350,108 791,899 1,863,332 30 June 1848 3,170,118 588,871 159,724 1,295,047 2,043,642 31 December 1848 3,089,659 645,468 176,824 1,189,213 2,011,505 30 June 1849 3,392,857 552,642 246,494 964,800 1,763,936 31 December 1849 3,680,623 686,761 264,577 973,691 1,224,029 30 June 1850 3,821,022 654,649 258,177 972,055 1,884,881 31 December 1850 3,969,648 566,039 334,982 1,089,794 1,990,815 30 June 1851 4,414,179 691,719 424,195 1,054,018 2,169,932 31 December 1851 4,677,298 653,946 378,337 1,054,018 2,080,301 30 June 1852 5,245,135 861,778 136,687 1,054,018 2,122,483 31 December 1852 5,581,706 855,057 397,087 1,119,477 2,371,621 30 June 1853 6,219,817 904,252 499,467 1,218,852 2,622,571 31 December 1853 6,259,540 791,699 677,392 1,468,902 2,937,993 30 June 1854 6,892,470 827,397 917,557 1,457,415 3,202,369 31 December 1854 7,177,244 694,309 486,400 1,451,074 2,631,783 30 June 1855 8,166,553 722,243 483,890 1,754,074 2,960,207 31 December 1855 8,744,095 847,856 451,575 1,949,074 3,248,505 30 June 1856 11,170,010 906,876 601,800 1,980,489 3,489,165 31 December 1856 11,438,461 1,119,591 432,000 2,922,625 4,474,216 30 June 1857 13,913,058 967,078 687,730 3,353,179 5,007,987 31 December 1857 113,889,021 2,226,441 1,115,883 3,582,797 6,923,121 1191

Obviously I'm not expecting to get a useful parse of that. I'd just like the stream to not crash so I can continue processing text chunks.

@AngledLuffa
Copy link
Collaborator

Are you getting a different exception, though? I get the following log & traceback:

Token indices sequence length is longer than the specified maximum sequence length for this model (715 > 512). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/john/stanza/stanza/pipeline/core.py", line 480, in __call__
    return self.process(doc, processors)
  File "/home/john/stanza/stanza/pipeline/core.py", line 431, in process
    doc = process(doc)
  File "/home/john/stanza/stanza/pipeline/pos_processor.py", line 91, in process
    dataset.doc.set([doc.UPOS, doc.XPOS, doc.FEATS], [y for x in preds for y in x])
  File "/home/john/stanza/stanza/models/common/doc.py", line 303, in set
    assert (to_token and self.num_tokens == len(contents)) or self.num_words == len(contents), \
AssertionError: Contents must have the same length as the original file.

@rmalouf
Copy link
Author

rmalouf commented May 24, 2024

Oh, you're right! I didn't look closely enough. First and last lines are the same but it's a different assertion that's failing. Sorry about that.

Jemoka pushed a commit that referenced this issue Jul 16, 2024
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants