Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix treebank detokenizer #2575

Merged
merged 12 commits into from
Dec 12, 2020
Merged

Conversation

orsharir
Copy link
Contributor

Fix a few issues with how the Treebank's word detokenizer mishandled quotes, punctuation marks, and parentheses, including #2295 and #2220.

orsharir added 11 commits July 28, 2020 21:59
Currently, the detokenizer doesn’t hanlde quotes quite well.

Signed-off-by: Or Sharir <[email protected]>
Python’s double-quote raw strings cannot represent a single double-quote. Use single-quote raw strings instead (though a regular string would suffice too).

Signed-off-by: Or Sharir <[email protected]>
* `` and '' should always be replaced with double quotes, regardless of spaces.
* Starting quotes should remove padding just on the right.
* Ending quotes should remove padding just on the left.
* The way tokenizer works, you first convert starting/ending quotes to their respective symbol and then add padding as needed. The detokenizer should work the same way and replace the spceial token back to double quotes only at the last regex.

Signed-off-by: Or Sharir <[email protected]>
The tokenizer checks for non-digits because we don’t want to add padding inside numbers and time formats (e.g., “1,000” or “9:30”). However, the detokenizer’s input will never have a colon/comma with spaces around if it was part of a number/time string. Thus, at this last point we just need to remove left padding in all cases, instead of checking for non-digit or whitespace (to handle cases where there’s a comma followed by double-quotes).

Signed-off-by: Or Sharir <[email protected]>
We shouldn’t catch whitespace in most cases where we don’t remove it because it prevents matching at end/start of a line, e.g., when a line ends on a right parentheses we should still remove the left padding.

Signed-off-by: Or Sharir <[email protected]>
(I keep them in code as a reminder for the reversed order in the tokenizer)

Signed-off-by: Or Sharir <[email protected]>
@stevenbird stevenbird merged commit 99ba8db into nltk:develop Dec 12, 2020
@stevenbird
Copy link
Member

Thanks @orsharir. Sorry for the delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants