Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed regex for trimming multiple spaces #58

Merged
merged 2 commits into from
Aug 26, 2015

Conversation

aqum
Copy link
Contributor

@aqum aqum commented Aug 18, 2015

\s matches for any whitespace including line breaks. This removed new lines when there was two or more and broke preserveLineBreaks option.

@dbashford
Copy link
Owner

Could you add a test for this?

@aqum
Copy link
Contributor Author

aqum commented Aug 19, 2015

Sure, test added for whole function.

But what is the reason behind replaceTextChars? Are other extractors behave in the same way and this is only for normalization? NON_ASCII_CHARS regex removes most of non-english letters and makes tool useless in other languages. F.e. in Polish there are ąśćźńóę and they all gets stripped out.

@dbashford
Copy link
Owner

It may be that replaceTextChars is no longer necessary given the work that takes place in cleaseText. I'll need some time to play with it and to get some tests in places to make sure removing it doesn't regress anything.

The whitelist will preserve the letters you are talking about.

@dbashford
Copy link
Owner

Going to merge this and repub. But when I find time will need to track down whether or not replaceTextChars is necessary. See #60.

dbashford added a commit that referenced this pull request Aug 26, 2015
Fixed regex for trimming multiple spaces
@dbashford dbashford merged commit b4e22e1 into dbashford:master Aug 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants