Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spans and error markup do still go badly together in some cases (Bugzilla Bug 1147) #2

Closed
albbas opened this issue Sep 22, 2011 · 8 comments
Labels
bug Something isn't working major

Comments

@albbas
Copy link

albbas commented Sep 22, 2011

This issue was created automatically with bugzilla2github

Bugzilla Bug 1147

Date: 2011-09-22T07:21:23+02:00
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
To: Børre Gaup <<borre.gaup>>
CC: ciprian.gerstenberger, sjur.n.moshagen, trond.trosterud

Last updated: 2011-09-23T16:46:27+02:00

@albbas
Copy link
Author

albbas commented Sep 22, 2011

Comment 5150

Date: 2011-09-22 07:21:23 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

To repeat:

Convert a text document with the following content:

I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen$(prop,typo|Asbjørnsen) døde 5. januar 1885, nesten 73 år gammel.

Given result:

I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen døde 5. januar 1885, nesten 73 år gammel.

Expected result:

I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen døde 5. januar 1885, nesten 73 år gammel.

The text cited above can be found in the document:

$GTFREE/goldstandard/orig/nob/ficti/nob-006-asbjmoe.correct.txt

@albbas
Copy link
Author

albbas commented Sep 22, 2011

Comment 5153

Date: 2011-09-22 15:08:49 +0200
From: Børre Gaup <<borre.gaup>>

It is text_cat that removes the error markup.

I use this input:

Asbjørnsen ble forstmester i Trøndelag 1860-64, leder av statens torvdriftsundersøkelse 1864-76. I 1863 begynte han å skrive noen bøker om skogskjøtsel og torvdrift. I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen$(prop,typo|Asbjørnsen) døde 5. januar 1885, nesten 73 år gammel.

If I remove lang rec (lines 113-116) from convert2xml.pl, the result is:

<p>Asbjørnsen ble forstmester i Trøndelag 1860-64, leder av statens torvdriftsundersøkelse 1864-76. I 1863 begynte han å skrive noen bøker om skogskjøtsel og torvdrift. I 1864 ga han ut boka "Fornuftigt Madstel". <errorort correct="Asbjørnsen" errtype="typo" pos="prop">Asbjørsen</errorort> døde 5. januar 1885, nesten 73 år gammel.

If I move lang rec in front of error markup (to line 97), the result is:

<p>. <errorort correct="Asbjørnsen" errtype="typo" pos="prop">Asbjørsen</errorort> døde 5. januar 1885, nesten 73 år gammel.

So, there is something wrong in both text_cat and the error markup function.

@albbas
Copy link
Author

albbas commented Sep 22, 2011

Comment 5154

Date: 2011-09-22 15:24:03 +0200
From: Trond Trosterud <<trond.trosterud>>

.. and lang rec without error markup? what is then the output?

@albbas
Copy link
Author

albbas commented Sep 22, 2011

Comment 5155

Date: 2011-09-22 15:36:35 +0200
From: Børre Gaup <<borre.gaup>>

lang rec w/o error markup gives this:

Asbjørnsen ble forstmester i Trøndelag 1860-64, leder av statens torvdriftsundersøkelse 1864-76. I 1863 begynte han å skrive noen bøker om skogskjøtsel og torvdrift. I 1864 ga han ut boka "Fornuftigt Madstel". Asbjørsen$(prop,typo|Asbjørnsen) døde 5. januar 1885, nesten 73 år gammel.

@albbas
Copy link
Author

albbas commented Sep 22, 2011

Comment 5156

Date: 2011-09-22 18:44:35 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

How do you know that it is text_cat? text_cat only returns the probable language of the text you feed it, it does not produce any xml.

To me this looks like it is the span conversion code that is the culprit. This code does include a call to text_cat, since the language of the span can be different from the surrounding paragraph.

There is no easy way out for this issue. We can expect to meet all of the following scenarios:

  • paragraph with span followed by error markup
  • paragraph with error markup followed by span
  • paragraph with several spans with error markup in between
  • paragraph with several error markups with spans between
  • spans with error markup inside (a misspelling within the span)
  • error markup with spans inside (e.g. a syntactic error spanning a full sentence, including some words that should be converted to spans)

The only thing that will not happen (hopefully) is that you have a span containing the start of an error, and the error markup closing after the span (ie crossing domains). This will not be convertible to XML, and we must ignore such cases. If we do find them, we will have to reconsider the error markup (since it is not representable in XML in any case).

I suggest you keep the order as it is now (error markup, then span conversion + text_cat), and make sure that you do not destroy any existing xml elements that could be part of the paragraph text in which you are looking for spans.

The pseudocode could be something like the following:

  1. look for spans (matching pairs of span identifiers) within a paragraph
  2. if found, store the xml fragment in front of the span (ie both text and elements), and do the same for the text following the span
  3. store the xml fragment within the span
  4. send the text of the span to text_cat
  5. construct a span with the language from text_cat, using the full xml fragment from within the span as the content of the new element
  6. add the preceeding xml fragment
  7. process the remaining xml fragment

The crucial thing is to save not only the text of each paragraph fragment, but the whole xml structure.

But I guess you know all this. Sorry for being so wordy for something that perhaps is obvious.

@albbas
Copy link
Author

albbas commented Sep 22, 2011

Comment 5157

Date: 2011-09-22 22:19:52 +0200
From: Børre Gaup <<borre.gaup>>

text_cat parses an xml document (our xml) in the function read_xml, and adds langs for p's. text_cat also has a function, mark_span, that adds a span around probable quotations, and it tries to detect the language at the same time.

So, text_cat both consumes and produces xml.

text_cat removes the error markup (that is error markup is first produced and ran through text_cat, error markup is removed), and if turned the other way around, then the error markup code swallows parts of the code that text_cat produced.

The code in both text_cat and error markup has to be fixed.

@albbas
Copy link
Author

albbas commented Sep 23, 2011

Comment 5159

Date: 2011-09-23 14:57:39 +0200
From: Børre Gaup <<borre.gaup>>

I have now fixed the error markup code in Corpus.pm so that it doesn't truncate the input and swallow tags. Next up is text_cat.

@albbas
Copy link
Author

albbas commented Sep 23, 2011

Comment 5160

Date: 2011-09-23 16:46:27 +0200
From: Børre Gaup <<borre.gaup>>

text_cat was fixed in r46563

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working major
Projects
None yet
Development

No branches or pull requests

1 participant