-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weird tokenization in Spanish #1440
Comments
This is an interesting / weird one. There are 3500 instances of "como" as an ADJ, SCONJ, or CCONJ in the training data, and 3 of it as a first person verb. So, ultimately I don't really see any way of fixing it, since the data is so heavily biased and there aren't that many first person verbs for any verb in the training data to begin with. We can keep it in mind as something that needs fixing, though |
Maybe we could try adding 10 different sentences with it as a verb and see if that helps... |
If i can support with data, let me know. i am running through A LOT of llm generated sentences and could capture them for you guys. my flow goes: so I can capture structured data for certain annotations with a simple |
I see that the model gets
Maybe what we could do would be
Is that something already available via your LLM work? If not, I could probably find something similar. We can start with 10 - I don't know if 10 will be enough, but probably it won't outweigh any of the other typical word senses for |
can you show me what the model input format looks like? |
Raw sentences could work, and I could send back the processing and you could tell me if it makes sense, or you could output the sentences with
|
The basic format would then look like the conll output I posted above, but it's not necessary to make it by hand. I think it'd be pretty straightforward to use the doc formatting on sentences with |
I just pushed out a new version, but this particular error still occurs. It's possible to update the models for the new version, though, if you have a few of the relevant sentences to add to the training data |
Describe the bug
In
yo como carne
,como
is identified asupos SCONJ
, while it should beVERB
.I am running this pipeline:
and get out:
the JSON format is due to me using this repo (mine):
https://github.com/vivalence/dockerized-stanza-nlp
Its really just a shallow wrapper.
The interesting lines are probably these
https://github.com/vivalence/dockerized-stanza-nlp/blob/main/script.py#L115-L116
The text was updated successfully, but these errors were encountered: