-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spans and error markup do still go badly together in some cases (Bugzilla Bug 1147) #2
Comments
Comment 5150Date: 2011-09-22 07:21:23 +0200 To repeat: Convert a text document with the following content: I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen$(prop,typo|Asbjørnsen) døde 5. januar 1885, nesten 73 år gammel. Given result: I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen døde 5. januar 1885, nesten 73 år gammel. Expected result: I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen døde 5. januar 1885, nesten 73 år gammel. The text cited above can be found in the document: $GTFREE/goldstandard/orig/nob/ficti/nob-006-asbjmoe.correct.txt |
Comment 5153Date: 2011-09-22 15:08:49 +0200 It is text_cat that removes the error markup. I use this input: Asbjørnsen ble forstmester i Trøndelag 1860-64, leder av statens torvdriftsundersøkelse 1864-76. I 1863 begynte han å skrive noen bøker om skogskjøtsel og torvdrift. I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen$(prop,typo|Asbjørnsen) døde 5. januar 1885, nesten 73 år gammel. If I remove lang rec (lines 113-116) from convert2xml.pl, the result is:
If I move lang rec in front of error markup (to line 97), the result is:
So, there is something wrong in both text_cat and the error markup function. |
Comment 5154Date: 2011-09-22 15:24:03 +0200 .. and lang rec without error markup? what is then the output? |
Comment 5155Date: 2011-09-22 15:36:35 +0200 lang rec w/o error markup gives this: Asbjørnsen ble forstmester i Trøndelag 1860-64, leder av statens torvdriftsundersøkelse 1864-76. I 1863 begynte han å skrive noen bøker om skogskjøtsel og torvdrift. I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen$(prop,typo|Asbjørnsen) døde 5. januar 1885, nesten 73 år gammel. |
Comment 5156Date: 2011-09-22 18:44:35 +0200 How do you know that it is text_cat? text_cat only returns the probable language of the text you feed it, it does not produce any xml. To me this looks like it is the span conversion code that is the culprit. This code does include a call to text_cat, since the language of the span can be different from the surrounding paragraph. There is no easy way out for this issue. We can expect to meet all of the following scenarios:
The only thing that will not happen (hopefully) is that you have a span containing the start of an error, and the error markup closing after the span (ie crossing domains). This will not be convertible to XML, and we must ignore such cases. If we do find them, we will have to reconsider the error markup (since it is not representable in XML in any case). I suggest you keep the order as it is now (error markup, then span conversion + text_cat), and make sure that you do not destroy any existing xml elements that could be part of the paragraph text in which you are looking for spans. The pseudocode could be something like the following:
The crucial thing is to save not only the text of each paragraph fragment, but the whole xml structure. But I guess you know all this. Sorry for being so wordy for something that perhaps is obvious. |
Comment 5157Date: 2011-09-22 22:19:52 +0200 text_cat parses an xml document (our xml) in the function read_xml, and adds langs for p's. text_cat also has a function, mark_span, that adds a span around probable quotations, and it tries to detect the language at the same time. So, text_cat both consumes and produces xml. text_cat removes the error markup (that is error markup is first produced and ran through text_cat, error markup is removed), and if turned the other way around, then the error markup code swallows parts of the code that text_cat produced. The code in both text_cat and error markup has to be fixed. |
Comment 5159Date: 2011-09-23 14:57:39 +0200 I have now fixed the error markup code in Corpus.pm so that it doesn't truncate the input and swallow tags. Next up is text_cat. |
Comment 5160Date: 2011-09-23 16:46:27 +0200 text_cat was fixed in r46563 |
This issue was created automatically with bugzilla2github
Bugzilla Bug 1147
Date: 2011-09-22T07:21:23+02:00
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
To: Børre Gaup <<borre.gaup>>
CC: ciprian.gerstenberger, sjur.n.moshagen, trond.trosterud
Last updated: 2011-09-23T16:46:27+02:00
The text was updated successfully, but these errors were encountered: