Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New validator rule: leaf-det (and det vs. nmod) #1059

Closed
nschneid opened this issue Oct 8, 2024 · 153 comments
Closed

New validator rule: leaf-det (and det vs. nmod) #1059

nschneid opened this issue Oct 8, 2024 · 153 comments

Comments

@nschneid
Copy link
Contributor

nschneid commented Oct 8, 2024

I notice that the leaf-det-clf rule introduced in UniversalDependencies/tools@1e4debd and then revised in UniversalDependencies/tools@759c5ae has invalidated quite a lot (a majority?) of treebanks.

Is further revision necessary? For example, EWT is still experiencing some errors that look like they should be valid:

  • det + nmod e.g. "at least some reports" (det(reports, some), nmod(some, least)). "at least" is admittedly ADV-like, so another option is to make it ExtPos=ADV and advmod.
  • "such"/det licensing an advcl, as in these results. The guidelines on sufficiency and excess for "so" and similar say the advcl should attach to the adjective or adverb, not the noun in a case like sufficient flour. In such a high price that nobody could afford it, I suppose "such" should have an advcl dependent?
@mr-martian
Copy link
Contributor

The errors in Hebrew are due to things like

# x- so the RTL text doesn't make this unreadable
32	x-ה	x-ה	DET	art	PronType=Art	33	det	_	Gloss=the|Ref=GEN_19.8
33	x-אֲנָשִׁ֤ים	x-אישׁ	NOUN	subs	Gender=Masc|Number=Plur	38	obl	_	Gloss=man|Ref=GEN_19.8
34-35	x-הָאֵל֙	x-_	_	_	_	_	_	_	_
34	x-הָ	x-ה	DET	art	PronType=Art	35	det	_	Gloss=the|Ref=GEN_19.8
35	x-אֵל֙	x-אל	PRON	prde	Number=Plur|PronType=Dem	33	det	_	Gloss=these|Ref=GEN_19.8

where demonstrative pronouns have their own determiners. (I'm open to other means of annotating this.)

@amir-zeldes
Copy link
Contributor

@mr-martian this is also the analysis used in the modern Hebrew TBs, so I would be inclined to accept and keep it (it's also parallel to how adjectival modification works in Hebrew)

@mr-martian
Copy link
Contributor

If I were doing Hebrew from scratch, the one alternative I'd consider is treating ה as an inflectional prefix rather than a syntactic word.

@amir-zeldes
Copy link
Contributor

I would vote against that TBH, it's not how other languages with repeating articles do it either (e.g. Greek) and it complicates lemmatization, type counts, and a bunch of other things.

@colinbatchelor
Copy link
Contributor

I have one remaining error:
[(in gd_arcosg-ud-train.conllu) Line 55940 Sent p01_033h Node 79]: [L3 Syntax leaf-det-clf] 'det' not expected to have children (79:a:det --> 81:h-uile:compound)

The offending tree has someone emphasising 'every' by saying a h-uile h-uile. Is there maybe a better way I should be doing this or could it be an exception?

@nschneid
Copy link
Contributor Author

Repetition for emphasis: would flat be a good option instead of compound? Cf. https://universaldependencies.org/u/dep/flat.html#iconic-sequences (though I can't speak to how languages are dealing with reduplication in general).

The validator currently allows fixed, but not flat, it seems.

@LeonieWeissweiler
Copy link
Contributor

LeonieWeissweiler commented Oct 10, 2024

This invalidated both HDT and GSD for German, mostly because of vor allem (mainly) and unter anderem (among others). For both, the first word is an ADP' and the second is a DET' that depends on it with the `case' relation.

How should we handle this better?

@nschneid
Copy link
Contributor Author

unter anderem is sometimes treated as a fixed expression. Here is a case triggering the error:

image

I assume this means "among other teachers"—is there a reason not to analyze it as "among [other teachers]", with unter attaching to Lehrer?

@amir-zeldes
Copy link
Contributor

No, for the German case it's not "among other teachers", notice "other" is dative but "teacher" is not - it's "among others, teachers". I think the mistake is the deprel det - this is not a determiner but an oblique modifier, just like English "among others".

@FedeIure
Copy link

Repetition for emphasis: would flat be a good option instead of compound? Cf. https://universaldependencies.org/u/dep/flat.html#iconic-sequences (though I can't speak to how languages are dealing with reduplication in general).

The validator currently allows fixed, but not flat, it seems.

What about flat:redup to mark repetition for emphasis?

Here two examples in one sentence from Roman tragedies in UD Latin-CIRCSE:

flat_redup_Latin_CIRCSE

@sylvainkahane
Copy link
Contributor

For spoken data, we need three relations to be added to the validator:

  • discourse, which is very common between two determiners in false starts: "a, uh, a gap", "my, uh, our friend"
  • parataxis for cases such as "a, I don't how to call that, a kiosk, …": here we have a reparandum link between the two "a"s and we would like to attach the parenthesis to the first "a". More exactly we use parataxis:parenth in our spoken French treebanks.
  • dep for false starts such as "the last, the last day": here "the last" forms a phrase the head of which is missing and we decided to have dep(the, last). I am not against another solution, as long as "the last" is still a phrase.

@lrituma
Copy link

lrituma commented Oct 15, 2024

In Latvian, we have several expressions considered as compound pronouns in Latvian traditional grammar which consist of one particle and one pronoun. For example, kaut kāds where kaut is a particle and kāds is a pronoun (this expression roughly means 'some kind of'). Currently, we annotate the particle as discourse which is dependent of pronoun, and pronoun occasionally becomes det if the expression describes a noun. This leads to validation error.

The particles in these expressions usually are kaut, diez, diezin, nez, nezin, and they all have very fuzzy, hard to pin down semantics so we feel uncomfortable annotating them as adverbs.

We would like to annotate these expressions as compound (instead of fixed) because the pronoun is the second element in the phrase and we feel that it is the head of the phrase because the pronoun inflects together with a noun and bears the most of semantic meaning of the expression.

Would you please consider allowing compound in this construction or is there any other option appropriate here?

@nschneid
Copy link
Contributor Author

@dan-zeman What about relaxing the error to a warning while we figure out the contours of the rule?

@Stormur
Copy link
Contributor

Stormur commented Oct 17, 2024

I think that this new rule is fine, even if, while correcting, I and colleagues have encountered a couple of cases which really do not look reducible to a trivial correction as all the others.

  1. The already mentioned reduplication, which is treated through flat:redup in Latin treebanks. One example is quot quot from quot: while the latter means 'as many as', the reduplication has a distributive sense as in 'for each possible one...' (this expression is sometimes even univerbated). I think to annotate them separately, each depending on the head, is not the right way to deal with them: here we do not have two or more different terms, but really the same one "clonating" itself. On the other hand, flat is really the closest relation we have to fixed, which would cause no problem, but is not a correct choice (well, in my opinion it is never the correct choice)
    • Problem: horizontal relation
  2. The phrase nostra qui remansissemus caede 'the murder of us who are left (behind)', but more literally 'our who are left murder', since nostra is the inflected possessive determiner for the 1st person plural. What happens here is that the possessive adds a nominal person, as it were, and this person is another referent beyond the noun caede 'murder' in this phrase; as such, the relative can target it (or at least, Cicero pleases himself in doing so). We could not really justify an analysis where we shift the relative under the head noun, since the murder is not one of its arguments.
    • Problem: the relative clause dependent of the determiner cannot be traced back to the referent of its head

To summarise the above discussion, my two proposals are to deactivate this validation rule if:

  1. the child of det is a flat relation
  2. the head element has the feature Person, at least for acl:relcl

@amir-zeldes
Copy link
Contributor

We have something similar to the case in 1. in Coptic where a word is repeated for distributive meaning:

  1. one one = "one by one"
  2. two two = "two by two, in pairs"
  3. color color = "color for color, every color"

Etc. 1-2 also work fine in modern Hebrew BTW, and 3. would work in the plural. What we did in UD Coptic was interpret them as nominal modifiers without a preposition (i.e. "one one" is the same as "one by one" with the word "by" suppressed). We then used the nmod:unmarked relation, which is a subtype of nmod used without a case marker.

@jasiewert
Copy link
Contributor

This new rule invalidates an analysis in my Low Saxon dataset that I just presented last spring in my LREC-COLING paper and discussed with other UD people at the conference, even with @dan-zeman himself, if I remember correctly. It is explained in Section 5.1 here: https://aclanthology.org/2024.lrec-main.1388.pdf The gloss and translation of the sentence can be found in Section 4.3.

Attaching the possessor in dative case to the possessee instead of the determiner does not represent the way this construction works because 1) the dative possessor cannot be attached to the possessee without the determiner and 2) the possessee can be dropped while the determiner cannot. E.g., in the example in my paper, "In der Gemoene iarem." (literally "in the parish hers") is a valid answer to a specification question in whose service the person stands. (A note to German speakers: Masculine and neuter nouns show that this is indeed a dative, not a genitive.)
The alternative to change the determiners' tags to PRON in Low Saxon would go against UD's own definition of determiners. I would therefore join @nschneid in asking you to relax the error to a warning or ask for language-specific exceptions to the rule.

nschneid referenced this issue in UniversalDependencies/UD_Erzya-JR Oct 21, 2024
@ftyers @jonorthwash Is there a way to get around Pronoun det with appos in (). This is something that might show up in a text «his (John's) text is strange.» I would have: det(text, his) appos(his, John's)
@lauma
Copy link
Contributor

lauma commented Oct 21, 2024

Also, in Latvian we struggle with constructions similar to "such a high price that nobody could afford it" from the original post as well.

@rueter
Copy link
Contributor

rueter commented Oct 21, 2024

Yes, @nschneid, I think the problem encountered in UD_Erzya-JR should be made explicit, here.
In Erzya (myv), Moksha (mdf) and Skolt Saami (sms), genitive forms of personal pronouns are regularly connected to their possessa with a ‹det› dependency.

sent_id = EKS:2011:39:15:ČesnokovF
Конат-конат сонзэ (Степан Иваныч) ладсо сырелгадсть...
Konat-konat    sonze    (Stepan Ivanych)  ladso    syrelgadstʹ...
such-such.Pl  his/her  (St. I.)                    in.way   become.older.3Pl

`some like him (Stepan Ivanich) had gotten older...'

obl(syrelgadstʹ, ladso)
det(ladso, sonze)
appos(sonze, Stepan)

This could also be dealt with as a postposition, where the noun ‹lad› `way' in the Inessive case would contribute to the same ‹obl› dependency

obl(syrelgadstʹ, sonze)
case(sonze, ladso)
appos(sonze, Stepan)

Departing from a ‹det› dependency, however, we could approach English(, but this is not what EWT does).

His friends come from all over.
det(friends, his)

In linguistics, such a sentence might be quoted with an inserted identifier for contextual clarity, e.g.,

His (Fred's) friends come from all over.
det(friends, his)
appos(His, Fred's)

Authors themselves [their very selves], might do the same thing with commas:
His, Fred's, friends come from all over.
det(friends, his)
appos(His, Fred's)

Since the validator does not allow words with a ‹det› dependency to take children, one might opt to follow a Swedish lead and change all instances of genitive-case personal pronoun ‹det› to ‹nmod:poss/nmod:det›, but wouldn't that go against the established norm?

Here is an example of Swedish hennes ‹her› given with ‹nmod:poss› dependency
The genitive form of a third person singular personal pronoun 'her'

# sent_id = sv-ud-dev-78
# text = Börjar hennes jobb att delas av den moderne mannen?
1	Börjar	börja	VERB	VB|PRS|AKT	Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	0:root	_
2	hennes	hon	PRON	PS|UTR/NEU|SIN/PLU|DEF	Definite=Def|Poss=Yes|PronType=Prs	3	nmod:poss	3:nmod:poss	_
3	jobb	jobb	NOUN	NN|NEU|SIN|IND|NOM	Case=Nom|Definite=Ind|Gender=Neut|Number=Sing	1	nsubj	1:nsubj|5:nsubj	_
4	att	att	PART	IE	_	5	mark	5:mark	_
5	delas	dela	VERB	VB|INF|SFO	VerbForm=Inf|Voice=Pass	1	xcomp	1:xcomp	_
6	av	av	ADP	PP	_	9	case	9:case	_
7	den	den	DET	DT|UTR|SIN|DEF	Definite=Def|Gender=Com|Number=Sing|PronType=Art	9	det	9:det	_
8	moderne	modern	ADJ	JJ|POS|MAS|SIN|DEF|NOM	Case=Nom|Definite=Def|Degree=Pos|Gender=Com|Number=Sing	9	amod	9:amod	_
9	mannen	man	NOUN	NN|UTR|SIN|DEF|NOM	Case=Nom|Definite=Def|Gender=Com|Number=Sing	5	obl:agent	5:obl:agent	SpaceAfter=No
10	?	?	PUNCT	MAD	_	1	punct	1:punct	_

In Swedish, the first and second person pronouns are associated with distinct determiners that are called pronouns in UD vår, min, er, din. These words inflect according to their possessa, and therefore they might be seen as analogically the same phenomena as the Czech possessive determiners.

`possessive determiners (which modify a nominal) (note that some languages use PRON for similar words): [cs] můj, tvůj, jeho, její, náš, váš, jejich'
See also
https://universaldependencies.org/cs/dep/nmod.html
The Czech is consistent.

https://universaldependencies.org/ru/dep/nmod.html
I note that Russian also ‹его карта› amod(карта, его)
translated as English ‹his card› amod(card, his)
Syntag appears to contradict this in ‹его мнению› his opinion' det(мнению, его) but also в его (и не только его, но и нашем) случае' ‹в его случае› `in his case' nmod(случае, его)

https://universaldependencies.org/en/dep/nmod.html
I note that the English provides ‹my office› nmod:poss(office, my)
which is the same coding as in EWT.

So it looks like there might be a Swedish–English consensus for nmod:poss use with possessive pronouns, and genitive personal pronouns.

There is disparity within the Russian corpora along side a consistent Czech.

@johnnymoretti
Copy link

211 treebanks are invalidated by this new rule, and we need guidance on what to do before the freeze!!! Please provide brief and clear instructions, as aligning the treebanks with this rule requires a lot of work.

@KoichiYasuoka
Copy link
Contributor

In Classical Chinese 彼此兵 (those and these soldiers) is invalidated by this new rule. How do we solve it?

# sent_id = KR2b0041_018_par8_1550-1557
# text = 訂彼此兵不得過關
1	訂	訂	VERB	v,動詞,行為,動作	_	0	root	_	Gloss=settle|SpaceAfter=No
2	彼	彼	PRON	n,代名詞,指示,*	PronType=Dem	4	det	_	Gloss=that|SpaceAfter=No
3	此	此	PRON	n,代名詞,指示,*	PronType=Dem	2	flat	_	Gloss=this|SpaceAfter=No
4	兵	兵	NOUN	n,名詞,人,役割	_	7	nsubj	_	Gloss=soldier|SpaceAfter=No
5	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	6	advmod	_	Gloss=not|SpaceAfter=No
6	得	得	AUX	v,助動詞,可能,*	Mood=Pot	7	aux	_	Gloss=must|SpaceAfter=No
7	過	過	VERB	v,動詞,行為,移動	_	1	ccomp	_	Gloss=pass|SpaceAfter=No
8	關	關	NOUN	n,名詞,固定物,建造物	Case=Loc	7	obj	_	Gloss=bar|SpaceAfter=No

those_and_these_soldiers

@rueter
Copy link
Contributor

rueter commented Oct 24, 2024

@nschneid, hi! the UD_Finnish-FTB has an interesting construction

# sent_id = j7hnk-6227
«Viron presidentti Lennart Meri on yksi niitä [ilmeisen harvoja valtiomiehiä], jotka laativat puheensa itse.»
«The President of Estonia, Lennart Meri, is one of the [apparently few statesmen] who write their own speeches.»

8       ilmeisen        ilmeinen        ADJ     A,Sg,Gen        Case=Gen|Number=Sing    9       amod    _       _
9       harvoja harva   DET     Pron,Qnt,Pl,Par Case=Par|Number=Plur|PronType=Ind       10      det     _       _
10      valtiomiehiä    valtiomies      NOUN    N,Pl,Par        Case=Par|Number=Plur    0       root    _       _

ilmeisen harvoja valtiomiehiä
The genitive-case adjective ‹apparent› modifies the determiner ‹few›.
This same construction with a genitive-case adjective is observed with expressions of color, e.g.

sininen ‹blue›
vaalean + sininen ‹light + blue›
tumman + punainen ‹dark + red›
NB! in Finnish these are written as one word, i.e., vaaleansininen, tummanpunainen, except, perhaps, when saying ‹especially dark red› erittäin tumman punainen
amod(red, dark)
advmod(dark, especially)
The Finnish grammar might make reference to an instructive case in -n, but this instance ilmeisen harvoja valtiomiehiä does not seem to fall into that category: https://kaino.kotus.fi/visk/sisallys.php?p=389.
What do you think @fginter, @flammie, @jpiitula?

rueter added a commit to UniversalDependencies/UD_Finnish-FTB that referenced this issue Oct 24, 2024
@jpiitula
sent_id = j7hnk-6227
is problematic.
See UniversalDependencies/docs#1059
@johnnymoretti
Copy link

The rule in validator script is something like that :

if re.match(r"^(det|clf)$", pdeprel) and not re.match(r"^(advmod|obl|goeswith|fixed|reparandum|conj|cc|punct)$",cdeprel) :

if I understand correctly we are allowed to use only obl and not obl:cmp , right ? If it is so, why ? The main dependency relation shouldn't cover also its subtypes ?

@KoichiYasuoka
Copy link
Contributor

Thank you @johnnymoretti but I think that det and clf cannot be treated in the same way. In Thai clf can be modified by ADJ or PRON (whose). On the other hand det can be linked by flat or conj...

@johnnymoretti
Copy link

@KoichiYasuoka For sure, I'm not going into detail about the language, I've just reported what the rule says. At the moment det and clf are in the same rule.

@Stormur
Copy link
Contributor

Stormur commented Oct 24, 2024

In Classical Chinese 彼此兵 (those and these soldiers) is invalidated by this new rule. How do we solve it?

Why not conj here?

@lauma
Copy link
Contributor

lauma commented Oct 24, 2024

In Latvian we have occasional subordinate clause problem as well - tās somas, ko atrada vakar 'those bags which were found yesterday', because in this situation we might as well talk about various kinds of bags, some where found yesterday, and some not. We struggle applying concept of determiners for Latvian in general, but this seems to be a determiner situation, right?

@Stormur
Copy link
Contributor

Stormur commented Oct 24, 2024

https://universaldependencies.org/en/dep/nmod.html I note that the English provides ‹my office› nmod:poss(office, my) which is the same coding as in EWT.

So it looks like there might be a Swedish–English consensus for nmod:poss use with possessive pronouns, and genitive personal pronouns.

But eng. my is not a pronoun... actually, I do not understand how my office can use nmod in English in the current standard.


The case you report

sent_id = EKS:2011:39:15:ČesnokovF
Конат-конат сонзэ (Степан Иваныч) ладсо сырелгадсть...
Konat-konat    sonze    (Stepan Ivanych)  ladso    syrelgadstʹ...
such-such.Pl  his/her  (St. I.)                    in.way   become.older.3Pl

looks very similar to the latin one I discussed: you have one element referring to the Person which is in сонзэ and which cannot go with ладсо. A kind of "double referent" phrase.


But we might be up to something regarding elements adding Persons. The case you report lets me wonder if indeed any element like this warrants nmod even when they look like other DETs.

@dan-zeman
Copy link
Member

@pkocharov : In Classical Armenian, prepositions and articles can be repeated with modifiers, including demonstrative pronominal adjectives, within NP, cf.

i kʻarancʻ y ayscʻanē from stone.ABL.PL from this.ABL.PL "from these (from) stones"

det(kʻarancʻ, ayscʻanē) case(kʻarancʻ, i) case(ayscʻanē, y)

Shall I change the UPOS of demonstrative pronominal adjectives from DET to PRON or ADJ and replace det with nmod or amod, accordingly? Would not this contradict the guidelines for annotating demonstratives?

I just added an exception to the test because I think we can hardly require that the demonstrative's case is attached as a second copy of the case to the head. I'm not particularly happy with it, as it also opens the door for other cases that are really errors (there was one Chinese example in this thread). Ideally the exception should be more focused and maybe specific to Classical Armenian.

@verenablaschke
Copy link
Member

@dan-zeman

@jasiewert forgive me if I forgot the discussion of this example at LREC (I am not saying it did not occur!) But attaching Gemoene as nmod:poss of iarem seems odd to me because it suggests that the parish is the possessor of "she", while in fact (if I understand it correctly), the parish and she are coreferential. Attaching them both directly to Denste would not seem so bad to me (and if the possessee is dropped, it could be solved using the standard approach to ellipsis: promotion).

FWIW, this is similar to the structure we decided to go with for the Bavarian treebank, e.g. "in Lutha seina Iwasetzung" ("Luther's translation") is annotated as follows:
bilde

It might make sense to discuss adnominal possessive (dative) constructions in a separate Github issue, since lots of Germanic languages appear to have similar constructions (albeit with some idiosyncrasies). The UD annotations are different for every language. (Norwegian "[possessor_PROPN] sin/sitt [possessed]" is annotated as "[possessor] <-obl- sin <-det- [possessed]", and Afrikaans "[possessor] se [possessed]" as "case([possessor], se); nmod([possessed], [possessor])" with se tagged as PART. The Dutch treebanks contain few cases of "[possessor] z'n/d'r [possessed]" and those are annotated differently (one as "[possessor] <-amod- z'n <-nmod:poss- [possessed]" and one as separate phrases ("[possessor]" and "z'n [possessed]") that are independently attached to the root.))

@dan-zeman
Copy link
Member

If you want to keep a constrained rule, we can allow discourse only when the determiner is a reparandum.

@sylvainkahane I am not sure I understand. If the discourse child is attached to a parent that itself is a reparandum, we do not have a problem at all (regardless whether the UPOS tag of the parent is DET). We would have a problem only if the determiner parent were attached as det.

@dan-zeman
Copy link
Member

  • parataxis for cases such as "a, I don't know how to call that, a kiosk, …": here we have a reparandum link between the two "a"s and we would like to attach the parenthesis to the first "a". More exactly we use parataxis:parenth in our spoken French treebanks.

Is attaching the parenthetical to the first determiner better than attaching it to the noun (kiosk)? Apart from the non-projectivity – I realize that there is similarity between this and the discourse point above.

Yes it is similar to the discourse marker case and I propose the same solution. Moreover, in this case, "a, I don't know how to call that" forms a kind of semantic and prosodic unit, which is not the case of "I don't know how to call that, a kiosk". I really want to attach the parenthesis to the first determiner.

Now I realize that here, too, we shouldn't have a problem because the first determiner should be attached to the kiosk as a reparandum:

a , I do n't know how to call that , a kiosk

reparandum(kiosk, a-1)
det(kiosk, a-12)
parataxis(a-1, know)

dan-zeman added a commit to UniversalDependencies/tools that referenced this issue Nov 3, 2024
@dan-zeman
Copy link
Member

The problem is not with the last day, the repandum starts from day. The problem is the analysis of the last, the false start itself. I don't like the idea to analyze it as a correct phrase, for instance with det(last,the). I want to keep the information that it is a false start and not a complete phrase. It is why we chose dep(the, last), but I am ok to use another relation.

It is not a complete phrase but I would still find it natural to apply the UD rules for ellipsis (=> for incomplete phrases), promote last and draw the relation det(last, the). I think the information that it is a false start is already encoded in the incoming reparandum relation (which could be further subtyped to reparandum:falsestart if it is needed).

I give you another example of false start, which I would like to analyze similarly:

les gens qu'on, qu'on voit pour la première fois
'people who we, who we see for the first time' (In French the relative pronoun cannot be omitted)

Here we have two pronouns qu' and on, which form the false start together, and I don't see what could be the link between them apart from dep.

I understand your concern, although I would claim that the UD ellipsis policy provides a possible solution here, too: orphan(on, qu').

Maybe a dedicated relation such as flat:disfluency?

Right now I think I prefer the ellipsis solution sketched above, but flat would probably still be better than dep. None of the two solutions should trigger the leaf-det validation test, so I think further discussion, if desirable, can take place in a new issue.

@sylvainkahane
Copy link
Contributor

@dan-zeman My mistake! Your validator doesn't forbid discourse, parataxis, or orphan depending on a DET which is reparandum. And it is good like this.

@dan-zeman
Copy link
Member

To summarise the above discussion, my two proposals are to deactivate this validation rule if:

  1. the child of det is a flat relation
  2. the head element has the feature Person, at least for acl:relcl

Or maybe Poss=Yes would be more to the point than non-empty Person.

@pkocharov
Copy link
Contributor

@dan-zeman thank you very much for an update.

I just added an exception to the test because I think we can hardly require that the demonstrative's case is attached as a second copy of the case to the head. I'm not particularly happy with it, as it also opens the door for other cases that are really errors (there was one Chinese example in this thread). Ideally the exception should be more focused and maybe specific to Classical Armenian.

Would the use of "dislocated" relation leading from the head to the copy adposition be entirely out of question here?
Otherwise, shall I revert "DET...amod" back to "DET...det" before the release?

@dan-zeman
Copy link
Member

The dislocated relation is intended for dependents of clauses, which is not the case here (and we would need an exception for it anyway). As for modifying the data during data freeze – I will answer that via e-mail in order not to spam the readers of this thread.

@dan-zeman
Copy link
Member

It would be really, really benefical if when the situation stabilizees there would be some kind of summary about the results of all this thing. With 100 comments I got lost in the end, but the core part of this discussion was very usefull at least for me to understand det better. Maybe even clarification/additions in guidelines?

I wrote a summary here. I'm leaving this issue open as a reminder that after the release, the warnings should be turned back to errors (but with the exceptions I have added).

@Stormur
Copy link
Contributor

Stormur commented Nov 4, 2024

  • det + nmod e.g. "at least some reports" (det(reports, some), nmod(some, least)). "at least" is admittedly ADV-like, so another option is to make it ExtPos=ADV and advmod.

I don't think at least is a nominal. Its head is not a NOUN, nor can it be replaced by a noun. If it is not a nominal, then it probably should not have nmod as its incoming relation. Least is an adjective and it is acceptable to use adjectives as advmod, so I would be inclined to use advmod here. The validator should be happy with it.

Is it acceptable? According to the guidelines, advmod should entail ADV, and viceversa. So this seems to be tolerated by the validator, but it is not standard, and in fact avoided by many (most?) treebanks. Or has something changed in the meantime? Not that I am against allowing adjectives to take advmod in certain conditions...

But in this case, the annotation of at least as advmod opens a can of worms. It would mean that as a consequence we can decide to annotate any kind of adverbial expression as advmod even if it is nominal like this one (for very good reasons: least is nominal; it is introduced by an adposition; it can take modifiers like in at the very least, etc.). We already have the triplet obl/advcl/advmod which covers these different cases: obl can be explained as "adverbial nominal phrase".. Else there is no reason not to annotate expressions such as by force as advmod, let's be careful.

@Stormur
Copy link
Contributor

Stormur commented Nov 4, 2024

We souldn't have a universal page for nmod:poss.

In hindsight, I tend to agree. It was me who created the page (as well as the universal page for det:poss). I did this when I was introducing the requirement that all deprel subtypes are documented and I wanted to reduce the number of treebanks that would immediately become invalid, so for some widely adopted subtypes I picked a language-specific documentation page and made it available at the universal level. That's also the reason why I'm not going to remove these pages now: Many treebanks would become invalid.

It would be better if somebody proposed an improvement of the universal page that will explain the misunderstandings discovered in this thread. Because non-existence of the universal documentation of nmod:poss and det:poss will not prevent people from using it. They have seen it in English or elsewhere, so they assume they should use it too, whenever they find anything "possessive" in their language.

I want to make clear that I do not want that page removed. There was an implication: if nmdo:poss was created to describe the Saxon genitive, then we must not have a universal page for it.

Of course people see this tag and decide to use it in their treebanks, and then the problem arises that language-specific subtypes must make clear that they are language-specific. So :poss as a subtype cannot be used for Saxon's genitive because "possessive" is a generic concept that is realised by any language in many different ways, and then the use of this subtype across treebanks becomes incommensurable (as is already the case with many misleadingly similar or generic tags).

In short, my point was that we need clear policies about subtype labels. And for those already existing, a reworking might be needed, as with nmod:poss. Is it then the case that maintainers of a treebank decide they cannot mark every possessive construction? Good, they will renounce to this specific tag, possibly create a very language-specific one, and the whole documentation will gain in clarity.

@Stormur
Copy link
Contributor

Stormur commented Nov 4, 2024

@Stormur : 1. The already mentioned reduplication, which is treated through flat:redup in Latin treebanks. One example is quot quot from quot: while the latter means 'as many as', the reduplication has a distributive sense as in 'for each possible one...' (this expression is sometimes even univerbated). I think to annotate them separately, each depending on the head, is not the right way to deal with them: here we do not have two or more different terms, but really the same one "clonating" itself. On the other hand, flat is really the closest relation we have to fixed, which would cause no problem, but is not a correct choice (well, in my opinion it is never the correct choice)

  • Problem: horizontal relation

Why is fixed not the correct choice? It looks like the correct choice to me. (Although I'm not in principle against relaxing the test also for flat.)

Because this is not a kind of "lexical crystallisation", but it is in fact a sort of derivational, to the limit of inflectional (if there is any difference...), process that can in principle be applied to any word. The exact effect of reduplication can be different (I'd have to look for more literature about that, but for Latin there does not seem to be so much), but it does look to be systematic. Latin just does not seem to use it so often as other languages. Here a quantitative meaning leads to a distributional reading. The deprel fixed, on the contrary, points to an idiosyncratic "merging" of the two words (by the way, I do not see it so distant from a dep...).

More general questions: in languages which express plurality by means of reduplication, what relation would you use for this kind of "extra-word inflection"? In my opinion, surely not fixed.

Anyway, I think it is just logical to allow any horizontal relation in this context, if one of them is accepted.

@Stormur
Copy link
Contributor

Stormur commented Nov 4, 2024

@pkocharov : In Classical Armenian, prepositions and articles can be repeated with modifiers, including demonstrative pronominal adjectives, within NP, cf.
i kʻarancʻ y ayscʻanē from stone.ABL.PL from this.ABL.PL "from these (from) stones"
det(kʻarancʻ, ayscʻanē) case(kʻarancʻ, i) case(ayscʻanē, y)
Shall I change the UPOS of demonstrative pronominal adjectives from DET to PRON or ADJ and replace det with nmod or amod, accordingly? Would not this contradict the guidelines for annotating demonstratives?

I just added an exception to the test because I think we can hardly require that the demonstrative's case is attached as a second copy of the case to the head. I'm not particularly happy with it, as it also opens the door for other cases that are really errors (there was one Chinese example in this thread). Ideally the exception should be more focused and maybe specific to Classical Armenian.

What do you think of the second from seen as an expl, a needed repetition close to the head if this is separated from the adposition by other material? While the first one is treated as the "main" adposition. It seems to be implied that this is a regular occurrence.

@Stormur
Copy link
Contributor

Stormur commented Nov 4, 2024

To summarise the above discussion, my two proposals are to deactivate this validation rule if:

  1. the child of det is a flat relation
  2. the head element has the feature Person, at least for acl:relcl

Or maybe Poss=Yes would be more to the point than non-empty Person.

As also @sylvainkahane pointed out, it seems that Poss=Yes is quite redundant. Are there cases where we have Poss=Yes but the Person is not marked / cannot be specified? Because in the cases that were discussed we always have an explicit person that is pointed to by the dependent.

@dan-zeman
Copy link
Member

Why is fixed not the correct choice? It looks like the correct choice to me. (Although I'm not in principle against relaxing the test also for flat.)

Because this is not a kind of "lexical crystallisation", but it is in fact a sort of derivational, to the limit of inflectional (if there is any difference...), process that can in principle be applied to any word. The exact effect of reduplication can be different (I'd have to look for more literature about that, but for Latin there does not seem to be so much), but it does look to be systematic. Latin just does not seem to use it so often as other languages. Here a quantitative meaning leads to a distributional reading. The deprel fixed, on the contrary, points to an idiosyncratic "merging" of the two words (by the way, I do not see it so distant from a dep...).

More general questions: in languages which express plurality by means of reduplication, what relation would you use for this kind of "extra-word inflection"? In my opinion, surely not fixed.

I am definitely not saying that every type of reduplication should be fixed; certainly not plural nouns (in Indonesian, these are even written as one token, http://hdl.handle.net/11346/PMLTQ-YXCN).

But if a fixed sequence of words works like a function word, e.g., a determiner, then I find it appropriate, and no proof of "lexical crystalization" is necessary; the fact that the multiword expression has different function than the members individually seems enough to me. After all, I also remember a discussion whose result was that more than should be a fixed expression.

@dan-zeman
Copy link
Member

@pkocharov : In Classical Armenian, prepositions and articles can be repeated with modifiers, including demonstrative pronominal adjectives, within NP, cf.
i kʻarancʻ y ayscʻanē from stone.ABL.PL from this.ABL.PL "from these (from) stones"
det(kʻarancʻ, ayscʻanē) case(kʻarancʻ, i) case(ayscʻanē, y)
Shall I change the UPOS of demonstrative pronominal adjectives from DET to PRON or ADJ and replace det with nmod or amod, accordingly? Would not this contradict the guidelines for annotating demonstratives?

I just added an exception to the test because I think we can hardly require that the demonstrative's case is attached as a second copy of the case to the head. I'm not particularly happy with it, as it also opens the door for other cases that are really errors (there was one Chinese example in this thread). Ideally the exception should be more focused and maybe specific to Classical Armenian.

What do you think of the second from seen as an expl, a needed repetition close to the head if this is separated from the adposition by other material? While the first one is treated as the "main" adposition. It seems to be implied that this is a regular occurrence.

As of now, I think expl is used only for pronouns in UD. In the taxonomy of UD relations places it among non-core dependents of clauses, and the first sentence of its definition says that it is for nominals. Not fitting adpositions.

@dan-zeman
Copy link
Member

Or maybe Poss=Yes would be more to the point than non-empty Person.

As also @sylvainkahane pointed out, it seems that Poss=Yes is quite redundant. Are there cases where we have Poss=Yes but the Person is not marked / cannot be specified?

Yes.

@Stormur
Copy link
Contributor

Stormur commented Nov 4, 2024

Why is fixed not the correct choice? It looks like the correct choice to me. (Although I'm not in principle against relaxing the test also for flat.)

Because this is not a kind of "lexical crystallisation", but it is in fact a sort of derivational, to the limit of inflectional (if there is any difference...), process that can in principle be applied to any word. The exact effect of reduplication can be different (I'd have to look for more literature about that, but for Latin there does not seem to be so much), but it does look to be systematic. Latin just does not seem to use it so often as other languages. Here a quantitative meaning leads to a distributional reading. The deprel fixed, on the contrary, points to an idiosyncratic "merging" of the two words (by the way, I do not see it so distant from a dep...).
More general questions: in languages which express plurality by means of reduplication, what relation would you use for this kind of "extra-word inflection"? In my opinion, surely not fixed.

I am definitely not saying that every type of reduplication should be fixed; certainly not plural nouns (in Indonesian, these are even written as one token, http://hdl.handle.net/11346/PMLTQ-YXCN).

But if a fixed sequence of words works like a function word, e.g., a determiner, then I find it appropriate, and no proof of "lexical crystalization" is necessary; the fact that the multiword expression has different function than the members individually seems enough to me. After all, I also remember a discussion whose result was that more than should be a fixed expression.

Why should it be more appropriate for function words than for content ones? I do not see a difference. A reduplicated pluralising expression, too, as a multiword has a different function than its corresponding simple one. This is not the "end phase of a process of grammaticalization", it is just a tool that the language can use. We are very much in the direction of an iconic sequence. On the contrary, fixed seems to imply some kind of non-transparency which cannot be found here.

quot quot is really keeping everything of quot intact, just implying a distributive reading which descends from its (indefinite) quantitative value.

As for more than, I do not think it has analogies with this case. But to put it very briefly I do not understand its treatment as fixed, especially not in the light of this discussion.

@dan-zeman
Copy link
Member

Why should it be more appropriate for function words than for content ones?

Because it is what fixed was designed for. Its documentation also says "Such expressions tend to behave like function words."

@Stormur
Copy link
Contributor

Stormur commented Nov 4, 2024

Why should it be more appropriate for function words than for content ones?

Because it is what fixed was designed for. Its documentation also says "Such expressions tend to behave like function words."

I can understand this point of view, but it does not seem to be explicit from the guidelines. There, the quality of being functional is referred to the output of the fixed expression, not to its components, which are very often content words. On the contrary, there does not seem to be the implication that flat combinations of function words require fixed (I don't think quack is a content word 🙂 ).

I mean, in this particular case there is no grammaticalisation: quot was functional and still stays functional when reduplicated. So the logic behind fixed does not seem applicable here.

Also, it is productive and regular (even if possibly confined to a very small class), which again does not seem to be represented by fixed, which seems to imply "each grammaticalisation has its own story".

@dan-zeman
Copy link
Member

I can understand this point of view, but it does not seem to be explicit from the guidelines.

It is as explicit as the sentence I cited; but it is not defined sharply, and I suppose it is meant to encompass less functional items such as the "light adverbials" that can modify function words. But while the boundary may be blurry, I think it is clear that the fixed expression should not be the main predicate of a clause, or a referential nominal.

There, the quality of being functional is referred to the output of the fixed expression, not to its components,

Exactly. I did not say that the nature of the components matters.

which are very often content words. On the contrary, there does not seem to be the implication that flat combinations of function words require fixed (I don't think quack is a content word 🙂 ).

I mean, in this particular case there is no grammaticalisation: quot was functional and still stays functional when reduplicated. So the logic behind fixed does not seem applicable here.

Also, it is productive and regular (even if possibly confined to a very small class), which again does not seem to be represented by fixed, which seems to imply "each grammaticalisation has its own story".

Fair enough. I tend to take fixed as a convenient tool when there is a functional expression which needs to be kept as one constituent, even without clear grammaticalization path, but I realize that such stance is at odds with the recommendation in the description of fixed.

dan-zeman added a commit to UniversalDependencies/tools that referenced this issue Nov 4, 2024
@gossebouma
Copy link
Contributor

It might make sense to discuss adnominal possessive (dative) constructions in a separate Github issue, since lots of Germanic languages appear to have similar constructions (albeit with some idiosyncrasies). The UD annotations are different for every language. (Norwegian "[possessor_PROPN] sin/sitt [possessed]" is annotated as "[possessor] <-obl- sin <-det- [possessed]", and Afrikaans "[possessor] se [possessed]" as "case([possessor], se); nmod([possessed], [possessor])" with se tagged as PART. The Dutch treebanks contain few cases of "[possessor] z'n/d'r [possessed]" and those are annotated differently (one as "[possessor] <-amod- z'n <-nmod:poss- [possessed]" and one as separate phrases ("[possessor]" and "z'n [possessed]") that are independently attached to the root.))

FWIW, I just noted the Luxembourgish LuxBank also adopts this analysis (their Fig 6)

Dan objects to the analysis proposed for Lower Saxon by pointing out that both are in fact coreferential, and the nmod:poss relation would be misleading here. One could perhaps save that analysis by annotating it as an apposition (given that a classical criterion for appositions is that they refer to the same entity), although it would mean that the apposition precedes the head.

As for the Dutch examples, they are clearly a mess, and need to be corrected ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests