-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update pdf.py PageObject.extractText() #334
Conversation
These changes allow for an optional text separator for TJ and Tj operators. These source alterations were originally suggested in StackOverflow at: http://stackoverflow.com/questions/11017379/pypdf-ignores-newlines-in-pdf-file by DSM I'm just passing along the good suggestion in hopes that the change may become standard in some future version.
Do you have an example where something else than a single whitespace would be desired? |
By the way: Sorry that it took so long to react! I do realize that you propably don't even remember this PR. Also: Don't worry about the failing tests; that is expected for this PR. |
Yeah, this was a while ago... Ok, I resurrected the project I was working on. So I was trying to extract text from some form-formatted pdf pages which had newlines separating the text I was interested in, so I used |
Features: - Add alpha channel support for png files in Script (#614) Bug fixes (BUG): - Fix formatWarning for filename without slash (#612) - Add whitespace between words for extractText() (#569, #334) - "invalid escape sequence" SyntaxError (#522) - Avoid error when printing warning in pythonw (#486) - Stream operations can be List or Dict (#665) Documentation (DOC): - Added Scripts/pdf-image-extractor.py - Documentation improvements (#550, #538, #324, #426, #394) Tests and Test setup (TST): - Add Github Action which automatically run unit tests via pytest and static code analysis with Flake8 (#660) - Add several unit tests (#661, #663) - Add .coveragerc to create coverage reports Developer Experience Improvements (DEV): - Pre commit: Developers can now `pre-commit install` to avoid tiny issues like trailing whitespaces Miscallenious: - Add the LICENSE file to the distributed packages (#288) - Use setuptools instead of distutils (#599) - Improvements for the PyPI page (#644) - Python 3 changes (#504, #366) You can see the full changelog at: 1.26.0...1.27.0
These changes allow for an optional text separator for TJ and Tj operators.
These source alterations were originally suggested in StackOverflow at:
http://stackoverflow.com/questions/11017379/pypdf-ignores-newlines-in-pdf-file
by DSM
I'm just passing along the good suggestion in hopes that the change may become standard in some future version.