Weird tokenization in Spanish #1440

LazerJesus · 2024-12-18T22:51:45Z

Describe the bug
In yo como carne, como is identified as upos SCONJ, while it should be VERB.

I am running this pipeline:

{
  "text": "Yo como carne.",
  "processors": "tokenize,mwt,pos,lemma,depparse",   
  "language": "es"
}

and get out:

[
        {
          "index": 1,
          "token": "Yo",
          "lemma": "yo",
          "xpos": "pp1csn00",
          "upos": "PRON",
          "feats": "Case=Nom|Number=Sing|Person=1|PronType=Prs",
          "start_char": 0,
          "end_char": 2
        },
        {
          "index": 2,
          "token": "como",
          "lemma": "como",
          "xpos": "cs",
          "upos": "SCONJ",
          "feats": null,
          "start_char": 3,
          "end_char": 7
        },
        {
          "index": 3,
          "token": "carne",
          "lemma": "carne",
          "xpos": "ncfs000",
          "upos": "NOUN",
          "feats": "Gender=Fem|Number=Sing",
          "start_char": 8,
          "end_char": 13
        },
        {
          "index": 4,
          "token": ".",
          "lemma": ".",
          "xpos": "fp",
          "upos": "PUNCT",
          "feats": "PunctType=Peri",
          "start_char": 13,
          "end_char": 14
        }
      ]

the JSON format is due to me using this repo (mine):
https://github.com/vivalence/dockerized-stanza-nlp
Its really just a shallow wrapper.
The interesting lines are probably these
https://github.com/vivalence/dockerized-stanza-nlp/blob/main/script.py#L115-L116

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2024-12-19T00:33:22Z

This is an interesting / weird one. There are 3500 instances of "como" as an ADJ, SCONJ, or CCONJ in the training data, and 3 of it as a first person verb. So, ultimately I don't really see any way of fixing it, since the data is so heavily biased and there aren't that many first person verbs for any verb in the training data to begin with. We can keep it in mind as something that needs fixing, though

AngledLuffa · 2024-12-19T00:36:17Z

Maybe we could try adding 10 different sentences with it as a verb and see if that helps...

LazerJesus · 2024-12-19T10:04:47Z

If i can support with data, let me know. i am running through A LOT of llm generated sentences and could capture them for you guys.
and i know the verb lemma, tense, person, and other annotations i am prompting the llm with.

my flow goes:
identify the verb annotation i want to practice
-> prompt llm to generate sentence
-> throw the sentence into stanza to get annotated tokens for every word.

so I can capture structured data for certain annotations with a simple if(annotation.matches(AngledLuffasCriteria)) appentToFile({sentence,promptedAnnotation})

AngledLuffa · 2024-12-19T16:05:08Z

I see that the model gets quiero correct, or so it appears to me

# text = Yo quiero carne
# sent_id = 0
1       Yo      yo      PRON    pp1csn00        Case=Nom|Number=Sing|Person=1|PronType=Prs      2       nsubj   _       start_char=0|end_char=2|ner=O
2       quiero  querer  VERB    vmip1s0 Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin   0       root    _       start_char=3|end_char=9|ner=O
3       carne   carne   NOUN    ncfs000 Gender=Fem|Number=Sing  2       obj     _       start_char=10|end_char=15|ner=O|SpaceAfter=No

Maybe what we could do would be

take 10 sentences with como used as a verb
replace those sentences with quiero and annotate with Stanza
replace those annotations back with como and include those in the training data

Is that something already available via your LLM work? If not, I could probably find something similar. We can start with 10 - I don't know if 10 will be enough, but probably it won't outweigh any of the other typical word senses for como, which as mentioned total about 3500 in the GSD and Ancora treebanks.

LazerJesus · 2024-12-19T19:28:18Z

can you show me what the model input format looks like?
i can give you a lot of data from my system. i run through maybe +500 sentences in a day.

AngledLuffa · 2024-12-19T19:31:45Z

Raw sentences could work, and I could send back the processing and you could tell me if it makes sense, or you could output the sentences with

print("{:C}".format(doc))

AngledLuffa · 2024-12-19T19:59:10Z

The basic format would then look like the conll output I posted above, but it's not necessary to make it by hand. I think it'd be pretty straightforward to use the doc formatting on sentences with quiero in place of como, then switch out the verbs. Probably shortish sentences so that it isn't too onerous to check and that errors elsewhere in the sentence are less likely. Hopefully not all of the format "I eat ---", though!

AngledLuffa · 2024-12-23T07:22:55Z

I just pushed out a new version, but this particular error still occurs. It's possible to update the models for the new version, though, if you have a few of the relevant sentences to add to the training data

LazerJesus added the bug label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird tokenization in Spanish #1440

Weird tokenization in Spanish #1440

LazerJesus commented Dec 18, 2024

AngledLuffa commented Dec 19, 2024

AngledLuffa commented Dec 19, 2024

LazerJesus commented Dec 19, 2024 •

edited

Loading

AngledLuffa commented Dec 19, 2024

LazerJesus commented Dec 19, 2024

AngledLuffa commented Dec 19, 2024

AngledLuffa commented Dec 19, 2024

AngledLuffa commented Dec 23, 2024

Weird tokenization in Spanish #1440

Weird tokenization in Spanish #1440

Comments

LazerJesus commented Dec 18, 2024

AngledLuffa commented Dec 19, 2024

AngledLuffa commented Dec 19, 2024

LazerJesus commented Dec 19, 2024 • edited Loading

AngledLuffa commented Dec 19, 2024

LazerJesus commented Dec 19, 2024

AngledLuffa commented Dec 19, 2024

AngledLuffa commented Dec 19, 2024

AngledLuffa commented Dec 23, 2024

LazerJesus commented Dec 19, 2024 •

edited

Loading