Fix treebank detokenizer #2575

orsharir · 2020-07-28T20:54:49Z

Fix a few issues with how the Treebank's word detokenizer mishandled quotes, punctuation marks, and parentheses, including #2295 and #2220.

Currently, the detokenizer doesn’t hanlde quotes quite well. Signed-off-by: Or Sharir <[email protected]>

Python’s double-quote raw strings cannot represent a single double-quote. Use single-quote raw strings instead (though a regular string would suffice too). Signed-off-by: Or Sharir <[email protected]>

* `` and '' should always be replaced with double quotes, regardless of spaces. * Starting quotes should remove padding just on the right. * Ending quotes should remove padding just on the left. * The way tokenizer works, you first convert starting/ending quotes to their respective symbol and then add padding as needed. The detokenizer should work the same way and replace the spceial token back to double quotes only at the last regex. Signed-off-by: Or Sharir <[email protected]>

The tokenizer checks for non-digits because we don’t want to add padding inside numbers and time formats (e.g., “1,000” or “9:30”). However, the detokenizer’s input will never have a colon/comma with spaces around if it was part of a number/time string. Thus, at this last point we just need to remove left padding in all cases, instead of checking for non-digit or whitespace (to handle cases where there’s a comma followed by double-quotes). Signed-off-by: Or Sharir <[email protected]>

Signed-off-by: Or Sharir <[email protected]>

We shouldn’t catch whitespace in most cases where we don’t remove it because it prevents matching at end/start of a line, e.g., when a line ends on a right parentheses we should still remove the left padding. Signed-off-by: Or Sharir <[email protected]>

(I keep them in code as a reminder for the reversed order in the tokenizer) Signed-off-by: Or Sharir <[email protected]>

Signed-off-by: Or Sharir <[email protected]>

stevenbird · 2020-12-12T21:05:41Z

Thanks @orsharir. Sorry for the delay.

orsharir added 11 commits July 28, 2020 21:59

Add tests for TreebankWordDetokenizer

27ab10f

Currently, the detokenizer doesn’t hanlde quotes quite well. Signed-off-by: Or Sharir <[email protected]>

Fix quotes substitution

3388385

Python’s double-quote raw strings cannot represent a single double-quote. Use single-quote raw strings instead (though a regular string would suffice too). Signed-off-by: Or Sharir <[email protected]>

Add my name to author’s list

ed83735

Signed-off-by: Or Sharir <[email protected]>

Added more tests for some edge cases.

50889af

Signed-off-by: Or Sharir <[email protected]>

Fix case of quotes followed by punctuation

2875824

Signed-off-by: Or Sharir <[email protected]>

Comment out unneeded lines

f7e971f

(I keep them in code as a reminder for the reversed order in the tokenizer) Signed-off-by: Or Sharir <[email protected]>

Remove redundent + sign. Corrent comment.

ae7862a

Signed-off-by: Or Sharir <[email protected]>

Add test for parentheses at end/start of line.

654c65e

Signed-off-by: Or Sharir <[email protected]>

orsharir mentioned this pull request Aug 9, 2020

Bookcorpus data contains pretokenized text huggingface/datasets#486

Closed

Merge branch 'develop' into fix_treebank_detokenizer

fa40753

stevenbird merged commit 99ba8db into nltk:develop Dec 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix treebank detokenizer #2575

Fix treebank detokenizer #2575

orsharir commented Jul 28, 2020

stevenbird commented Dec 12, 2020

Fix treebank detokenizer #2575

Fix treebank detokenizer #2575

Conversation

orsharir commented Jul 28, 2020

stevenbird commented Dec 12, 2020