-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer doesn't respect combined_electra-large's max_length #1294
Comments
Yes, this is a known issue. We either need to use a transformer that
allows a bigger window, or in some way combine representations to get
decent results from a longer sentence. The biggest reason we can't simply
use two consecutive iterations of the transformer is the second half of the
sequence would treat a word in the middle of the sentence as the start of
the sentence, considering the way the transformer positional encodings work.
We hope to address this by the end of the year, but there are several
things in our task list which need handling. A simple enough stopgap might
be to fall back to the non-transformer model for sentences which are too
long. In the meantime, you might consider discarding sentences which are
that long.
|
I'd be fine with truncating or discarding long sentences for now, but unfortunately I can't tell that they're too long until after the text is tokenized. Is there an easy built-in way to truncate sentences mid-pipeline, or will I need to add a custom processor? |
I've got a rather high priority thing to work on today and tomorrow, but I
can try to have a thing ready by Friday which at least avoids the crash
…On Mon, Oct 9, 2023 at 12:13 PM Rob Malouf ***@***.***> wrote:
I'd be fine with truncating or discarding long sentences for now, but
unfortunately I can't tell that they're too long until after the text is
tokenized. Is there an easy built-in way to truncate sentences
mid-pipeline, or will I need to add a custom processor?
—
Reply to this email directly, view it on GitHub
<#1294 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWJKOP7ZNKAJOPVIZ5DX6REEZAVCNFSM6AAAAAA5YFI4SWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJTGUZTQNZYGE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Thanks, that's very generous! But you can hold off -- I'll take a stab at it first and come back next week if I can't make it work. |
Hi, I also get this error. Any updates or workarounds on this? |
Let me see if I can get to it this winter break
|
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
I'm still running into this in stanza 1.8.2. An offending text fragment is:
Obviously I'm not expecting to get a useful parse of that. I'd just like the stream to not crash so I can continue processing text chunks. |
Are you getting a different exception, though? I get the following log & traceback:
|
Oh, you're right! I didn't look closely enough. First and last lines are the same but it's a different assertion that's failing. Sorry about that. |
…hat the transformer can digest, even if it isn't necessarily going to give great results for the later tokens in the sentence. Addresses #1294
Describe the bug
When parsing a long text using the latest "combined_electra-large" model, I get the error:
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: