Spans and error markup do still go badly together in some cases (Bugzilla Bug 1147) #2

albbas · 2011-09-22T05:21:23Z

This issue was created automatically with bugzilla2github

Bugzilla Bug 1147

Date: 2011-09-22T07:21:23+02:00
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
To: Børre Gaup <<borre.gaup>>
CC: ciprian.gerstenberger, sjur.n.moshagen, trond.trosterud

Last updated: 2011-09-23T16:46:27+02:00

albbas · 2011-09-22T05:21:23Z

Comment 5150

Date: 2011-09-22 07:21:23 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

To repeat:

Convert a text document with the following content:

I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen$(prop,typo|Asbjørnsen) døde 5. januar 1885, nesten 73 år gammel.

Given result:

I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen døde 5. januar 1885, nesten 73 år gammel.

Expected result:

I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen døde 5. januar 1885, nesten 73 år gammel.

The text cited above can be found in the document:

$GTFREE/goldstandard/orig/nob/ficti/nob-006-asbjmoe.correct.txt

albbas · 2011-09-22T13:08:49Z

Comment 5153

Date: 2011-09-22 15:08:49 +0200
From: Børre Gaup <<borre.gaup>>

It is text_cat that removes the error markup.

I use this input:

Asbjørnsen ble forstmester i Trøndelag 1860-64, leder av statens torvdriftsundersøkelse 1864-76. I 1863 begynte han å skrive noen bøker om skogskjøtsel og torvdrift. I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen$(prop,typo|Asbjørnsen) døde 5. januar 1885, nesten 73 år gammel.

If I remove lang rec (lines 113-116) from convert2xml.pl, the result is:

<p>Asbjørnsen ble forstmester i Trøndelag 1860-64, leder av statens torvdriftsundersøkelse 1864-76. I 1863 begynte han å skrive noen bøker om skogskjøtsel og torvdrift. I 1864 ga han ut boka "Fornuftigt Madstel". <errorort correct="Asbjørnsen" errtype="typo" pos="prop">Asbjørsen</errorort> døde 5. januar 1885, nesten 73 år gammel.

If I move lang rec in front of error markup (to line 97), the result is:

<p>. <errorort correct="Asbjørnsen" errtype="typo" pos="prop">Asbjørsen</errorort> døde 5. januar 1885, nesten 73 år gammel.

So, there is something wrong in both text_cat and the error markup function.

albbas · 2011-09-22T13:24:03Z

Comment 5154

Date: 2011-09-22 15:24:03 +0200
From: Trond Trosterud <<trond.trosterud>>

.. and lang rec without error markup? what is then the output?

albbas · 2011-09-22T13:36:35Z

Comment 5155

Date: 2011-09-22 15:36:35 +0200
From: Børre Gaup <<borre.gaup>>

lang rec w/o error markup gives this:

Asbjørnsen ble forstmester i Trøndelag 1860-64, leder av statens torvdriftsundersøkelse 1864-76. I 1863 begynte han å skrive noen bøker om skogskjøtsel og torvdrift. I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen$(prop,typo|Asbjørnsen) døde 5. januar 1885, nesten 73 år gammel.

albbas · 2011-09-22T16:44:35Z

Comment 5156

Date: 2011-09-22 18:44:35 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

How do you know that it is text_cat? text_cat only returns the probable language of the text you feed it, it does not produce any xml.

To me this looks like it is the span conversion code that is the culprit. This code does include a call to text_cat, since the language of the span can be different from the surrounding paragraph.

There is no easy way out for this issue. We can expect to meet all of the following scenarios:

paragraph with span followed by error markup
paragraph with error markup followed by span
paragraph with several spans with error markup in between
paragraph with several error markups with spans between
spans with error markup inside (a misspelling within the span)
error markup with spans inside (e.g. a syntactic error spanning a full sentence, including some words that should be converted to spans)

The only thing that will not happen (hopefully) is that you have a span containing the start of an error, and the error markup closing after the span (ie crossing domains). This will not be convertible to XML, and we must ignore such cases. If we do find them, we will have to reconsider the error markup (since it is not representable in XML in any case).

I suggest you keep the order as it is now (error markup, then span conversion + text_cat), and make sure that you do not destroy any existing xml elements that could be part of the paragraph text in which you are looking for spans.

The pseudocode could be something like the following:

look for spans (matching pairs of span identifiers) within a paragraph
if found, store the xml fragment in front of the span (ie both text and elements), and do the same for the text following the span
store the xml fragment within the span
send the text of the span to text_cat
construct a span with the language from text_cat, using the full xml fragment from within the span as the content of the new element
add the preceeding xml fragment
process the remaining xml fragment

The crucial thing is to save not only the text of each paragraph fragment, but the whole xml structure.

But I guess you know all this. Sorry for being so wordy for something that perhaps is obvious.

albbas · 2011-09-22T20:19:52Z

Comment 5157

Date: 2011-09-22 22:19:52 +0200
From: Børre Gaup <<borre.gaup>>

text_cat parses an xml document (our xml) in the function read_xml, and adds langs for p's. text_cat also has a function, mark_span, that adds a span around probable quotations, and it tries to detect the language at the same time.

So, text_cat both consumes and produces xml.

text_cat removes the error markup (that is error markup is first produced and ran through text_cat, error markup is removed), and if turned the other way around, then the error markup code swallows parts of the code that text_cat produced.

The code in both text_cat and error markup has to be fixed.

albbas · 2011-09-23T12:57:39Z

Comment 5159

Date: 2011-09-23 14:57:39 +0200
From: Børre Gaup <<borre.gaup>>

I have now fixed the error markup code in Corpus.pm so that it doesn't truncate the input and swallow tags. Next up is text_cat.

albbas · 2011-09-23T14:46:27Z

Comment 5160

Date: 2011-09-23 16:46:27 +0200
From: Børre Gaup <<borre.gaup>>

text_cat was fixed in r46563

albbas closed this as completed Sep 23, 2011

This was referenced Aug 28, 2024

HFST-korpusanalysen klarer ikke URLer (Bugzilla Bug 2509) giellalt/lang-sme#461

Closed

Nytt verkty for å sortera taggar og sjekka lexc-strukturen i stems-filene (Bugzilla Bug 2487) giellalt/giella-core#61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spans and error markup do still go badly together in some cases (Bugzilla Bug 1147) #2

Spans and error markup do still go badly together in some cases (Bugzilla Bug 1147) #2

albbas commented Sep 22, 2011

albbas commented Sep 22, 2011

albbas commented Sep 22, 2011

albbas commented Sep 22, 2011

albbas commented Sep 22, 2011

albbas commented Sep 22, 2011

albbas commented Sep 22, 2011

albbas commented Sep 23, 2011

albbas commented Sep 23, 2011

Spans and error markup do still go badly together in some cases (Bugzilla Bug 1147) #2

Spans and error markup do still go badly together in some cases (Bugzilla Bug 1147) #2

Comments

albbas commented Sep 22, 2011

Bugzilla Bug 1147

albbas commented Sep 22, 2011

Comment 5150

albbas commented Sep 22, 2011

Comment 5153

albbas commented Sep 22, 2011

Comment 5154

albbas commented Sep 22, 2011

Comment 5155

albbas commented Sep 22, 2011

Comment 5156

albbas commented Sep 22, 2011

Comment 5157

albbas commented Sep 23, 2011

Comment 5159

albbas commented Sep 23, 2011

Comment 5160