-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New validator rule: leaf-det (and det vs. nmod) #1059
Comments
The errors in Hebrew are due to things like # x- so the RTL text doesn't make this unreadable
32 x-ה x-ה DET art PronType=Art 33 det _ Gloss=the|Ref=GEN_19.8
33 x-אֲנָשִׁ֤ים x-אישׁ NOUN subs Gender=Masc|Number=Plur 38 obl _ Gloss=man|Ref=GEN_19.8
34-35 x-הָאֵל֙ x-_ _ _ _ _ _ _ _
34 x-הָ x-ה DET art PronType=Art 35 det _ Gloss=the|Ref=GEN_19.8
35 x-אֵל֙ x-אל PRON prde Number=Plur|PronType=Dem 33 det _ Gloss=these|Ref=GEN_19.8 where demonstrative pronouns have their own determiners. (I'm open to other means of annotating this.) |
@mr-martian this is also the analysis used in the modern Hebrew TBs, so I would be inclined to accept and keep it (it's also parallel to how adjectival modification works in Hebrew) |
If I were doing Hebrew from scratch, the one alternative I'd consider is treating ה as an inflectional prefix rather than a syntactic word. |
I would vote against that TBH, it's not how other languages with repeating articles do it either (e.g. Greek) and it complicates lemmatization, type counts, and a bunch of other things. |
I have one remaining error:
The offending tree has someone emphasising 'every' by saying a h-uile h-uile. Is there maybe a better way I should be doing this or could it be an exception? |
Repetition for emphasis: would The validator currently allows |
This invalidated both HDT and GSD for German, mostly because of vor allem (mainly) and unter anderem (among others). For both, the first word is an How should we handle this better? |
No, for the German case it's not "among other teachers", notice "other" is dative but "teacher" is not - it's "among others, teachers". I think the mistake is the deprel det - this is not a determiner but an oblique modifier, just like English "among others". |
What about Here two examples in one sentence from Roman tragedies in UD Latin-CIRCSE: |
For spoken data, we need three relations to be added to the validator:
|
In Latvian, we have several expressions considered as compound pronouns in Latvian traditional grammar which consist of one particle and one pronoun. For example, kaut kāds where kaut is a particle and kāds is a pronoun (this expression roughly means 'some kind of'). Currently, we annotate the particle as The particles in these expressions usually are kaut, diez, diezin, nez, nezin, and they all have very fuzzy, hard to pin down semantics so we feel uncomfortable annotating them as adverbs. We would like to annotate these expressions as Would you please consider allowing |
@dan-zeman What about relaxing the error to a warning while we figure out the contours of the rule? |
I think that this new rule is fine, even if, while correcting, I and colleagues have encountered a couple of cases which really do not look reducible to a trivial correction as all the others.
To summarise the above discussion, my two proposals are to deactivate this validation rule if:
|
We have something similar to the case in 1. in Coptic where a word is repeated for distributive meaning:
Etc. 1-2 also work fine in modern Hebrew BTW, and 3. would work in the plural. What we did in UD Coptic was interpret them as nominal modifiers without a preposition (i.e. "one one" is the same as "one by one" with the word "by" suppressed). We then used the |
This new rule invalidates an analysis in my Low Saxon dataset that I just presented last spring in my LREC-COLING paper and discussed with other UD people at the conference, even with @dan-zeman himself, if I remember correctly. It is explained in Section 5.1 here: https://aclanthology.org/2024.lrec-main.1388.pdf The gloss and translation of the sentence can be found in Section 4.3. Attaching the possessor in dative case to the possessee instead of the determiner does not represent the way this construction works because 1) the dative possessor cannot be attached to the possessee without the determiner and 2) the possessee can be dropped while the determiner cannot. E.g., in the example in my paper, "In der Gemoene iarem." (literally "in the parish hers") is a valid answer to a specification question in whose service the person stands. (A note to German speakers: Masculine and neuter nouns show that this is indeed a dative, not a genitive.) |
@ftyers @jonorthwash Is there a way to get around Pronoun det with appos in (). This is something that might show up in a text «his (John's) text is strange.» I would have: det(text, his) appos(his, John's)
Also, in Latvian we struggle with constructions similar to "such a high price that nobody could afford it" from the original post as well. |
Yes, @nschneid, I think the problem encountered in UD_Erzya-JR should be made explicit, here.
`some like him (Stepan Ivanich) had gotten older...' obl(syrelgadstʹ, ladso) This could also be dealt with as a postposition, where the noun ‹lad› `way' in the Inessive case would contribute to the same ‹obl› dependency obl(syrelgadstʹ, sonze) Departing from a ‹det› dependency, however, we could approach English(, but this is not what EWT does). His friends come from all over. In linguistics, such a sentence might be quoted with an inserted identifier for contextual clarity, e.g., His (Fred's) friends come from all over. Authors themselves [their very selves], might do the same thing with commas: Since the validator does not allow words with a ‹det› dependency to take children, one might opt to follow a Swedish lead and change all instances of genitive-case personal pronoun ‹det› to ‹nmod:poss/nmod:det›, but wouldn't that go against the established norm? Here is an example of Swedish
In Swedish, the first and second person pronouns are associated with distinct determiners that are called pronouns in UD vår, min, er, din. These words inflect according to their possessa, and therefore they might be seen as analogically the same phenomena as the Czech possessive determiners. `possessive determiners (which modify a nominal) (note that some languages use PRON for similar words): [cs] můj, tvůj, jeho, její, náš, váš, jejich' https://universaldependencies.org/ru/dep/nmod.html https://universaldependencies.org/en/dep/nmod.html So it looks like there might be a Swedish–English consensus for nmod:poss use with possessive pronouns, and genitive personal pronouns. There is disparity within the Russian corpora along side a consistent Czech. |
211 treebanks are invalidated by this new rule, and we need guidance on what to do before the freeze!!! Please provide brief and clear instructions, as aligning the treebanks with this rule requires a lot of work. |
In Classical Chinese 彼此兵 (those and these soldiers) is invalidated by this new rule. How do we solve it?
|
@nschneid, hi! the UD_Finnish-FTB has an interesting construction
ilmeisen harvoja valtiomiehiä sininen ‹blue› |
@jpiitula sent_id = j7hnk-6227 is problematic. See UniversalDependencies/docs#1059
The rule in validator script is something like that :
if I understand correctly we are allowed to use only |
Thank you @johnnymoretti but I think that |
@KoichiYasuoka For sure, I'm not going into detail about the language, I've just reported what the rule says. At the moment |
Why not |
In Latvian we have occasional subordinate clause problem as well - tās somas, ko atrada vakar 'those bags which were found yesterday', because in this situation we might as well talk about various kinds of bags, some where found yesterday, and some not. We struggle applying concept of determiners for Latvian in general, but this seems to be a determiner situation, right? |
But eng. my is not a pronoun... actually, I do not understand how my office can use The case you report
looks very similar to the latin one I discussed: you have one element referring to the But we might be up to something regarding elements adding |
I just added an exception to the test because I think we can hardly require that the demonstrative's case is attached as a second copy of the case to the head. I'm not particularly happy with it, as it also opens the door for other cases that are really errors (there was one Chinese example in this thread). Ideally the exception should be more focused and maybe specific to Classical Armenian. |
FWIW, this is similar to the structure we decided to go with for the Bavarian treebank, e.g. "in Lutha seina Iwasetzung" ("Luther's translation") is annotated as follows: It might make sense to discuss adnominal possessive (dative) constructions in a separate Github issue, since lots of Germanic languages appear to have similar constructions (albeit with some idiosyncrasies). The UD annotations are different for every language. (Norwegian "[possessor_PROPN] sin/sitt [possessed]" is annotated as "[possessor] <-obl- sin <-det- [possessed]", and Afrikaans "[possessor] se [possessed]" as "case([possessor], se); nmod([possessed], [possessor])" with se tagged as PART. The Dutch treebanks contain few cases of "[possessor] z'n/d'r [possessed]" and those are annotated differently (one as "[possessor] <-amod- z'n <-nmod:poss- [possessed]" and one as separate phrases ("[possessor]" and "z'n [possessed]") that are independently attached to the root.)) |
@sylvainkahane I am not sure I understand. If the |
Now I realize that here, too, we shouldn't have a problem because the first determiner should be attached to the kiosk as a a , I do n't know how to call that , a kiosk
|
It is not a complete phrase but I would still find it natural to apply the UD rules for ellipsis (=> for incomplete phrases), promote last and draw the relation
I understand your concern, although I would claim that the UD ellipsis policy provides a possible solution here, too:
Right now I think I prefer the ellipsis solution sketched above, but |
@dan-zeman My mistake! Your validator doesn't forbid |
Or maybe |
@dan-zeman thank you very much for an update.
Would the use of "dislocated" relation leading from the head to the copy adposition be entirely out of question here? |
The |
I wrote a summary here. I'm leaving this issue open as a reminder that after the release, the warnings should be turned back to errors (but with the exceptions I have added). |
Is it acceptable? According to the guidelines, But in this case, the annotation of at least as |
I want to make clear that I do not want that page removed. There was an implication: if nmdo:poss was created to describe the Saxon genitive, then we must not have a universal page for it. Of course people see this tag and decide to use it in their treebanks, and then the problem arises that language-specific subtypes must make clear that they are language-specific. So In short, my point was that we need clear policies about subtype labels. And for those already existing, a reworking might be needed, as with |
Because this is not a kind of "lexical crystallisation", but it is in fact a sort of derivational, to the limit of inflectional (if there is any difference...), process that can in principle be applied to any word. The exact effect of reduplication can be different (I'd have to look for more literature about that, but for Latin there does not seem to be so much), but it does look to be systematic. Latin just does not seem to use it so often as other languages. Here a quantitative meaning leads to a distributional reading. The deprel More general questions: in languages which express plurality by means of reduplication, what relation would you use for this kind of "extra-word inflection"? In my opinion, surely not Anyway, I think it is just logical to allow any horizontal relation in this context, if one of them is accepted. |
What do you think of the second from seen as an |
As also @sylvainkahane pointed out, it seems that |
I am definitely not saying that every type of reduplication should be But if a fixed sequence of words works like a function word, e.g., a determiner, then I find it appropriate, and no proof of "lexical crystalization" is necessary; the fact that the multiword expression has different function than the members individually seems enough to me. After all, I also remember a discussion whose result was that more than should be a fixed expression. |
As of now, I think |
Yes. |
Why should it be more appropriate for function words than for content ones? I do not see a difference. A reduplicated pluralising expression, too, as a multiword has a different function than its corresponding simple one. This is not the "end phase of a process of grammaticalization", it is just a tool that the language can use. We are very much in the direction of an iconic sequence. On the contrary, quot quot is really keeping everything of quot intact, just implying a distributive reading which descends from its (indefinite) quantitative value. As for more than, I do not think it has analogies with this case. But to put it very briefly I do not understand its treatment as |
Because it is what |
I can understand this point of view, but it does not seem to be explicit from the guidelines. There, the quality of being functional is referred to the output of the I mean, in this particular case there is no grammaticalisation: quot was functional and still stays functional when reduplicated. So the logic behind Also, it is productive and regular (even if possibly confined to a very small class), which again does not seem to be represented by |
It is as explicit as the sentence I cited; but it is not defined sharply, and I suppose it is meant to encompass less functional items such as the "light adverbials" that can modify function words. But while the boundary may be blurry, I think it is clear that the fixed expression should not be the main predicate of a clause, or a referential nominal.
Exactly. I did not say that the nature of the components matters.
Fair enough. I tend to take |
FWIW, I just noted the Luxembourgish LuxBank also adopts this analysis (their Fig 6) Dan objects to the analysis proposed for Lower Saxon by pointing out that both are in fact coreferential, and the nmod:poss relation would be misleading here. One could perhaps save that analysis by annotating it as an apposition (given that a classical criterion for appositions is that they refer to the same entity), although it would mean that the apposition precedes the head. As for the Dutch examples, they are clearly a mess, and need to be corrected ;-) |
I notice that the leaf-det-clf rule introduced in UniversalDependencies/tools@1e4debd and then revised in UniversalDependencies/tools@759c5ae has invalidated quite a lot (a majority?) of treebanks.
Is further revision necessary? For example, EWT is still experiencing some errors that look like they should be valid:
det
+nmod
e.g. "at least some reports" (det(reports, some)
,nmod(some, least)
). "at least" is admittedly ADV-like, so another option is to make itExtPos=ADV
andadvmod
.det
licensing anadvcl
, as in these results. The guidelines on sufficiency and excess for "so" and similar say theadvcl
should attach to the adjective or adverb, not the noun in a case like sufficient flour. In such a high price that nobody could afford it, I suppose "such" should have anadvcl
dependent?The text was updated successfully, but these errors were encountered: