Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pdf.py PageObject.extractText() #334

Merged
merged 3 commits into from
Apr 7, 2022
Merged

Update pdf.py PageObject.extractText() #334

merged 3 commits into from
Apr 7, 2022

Conversation

jusdino
Copy link
Contributor

@jusdino jusdino commented Mar 19, 2017

These changes allow for an optional text separator for TJ and Tj operators.

These source alterations were originally suggested in StackOverflow at:
http://stackoverflow.com/questions/11017379/pypdf-ignores-newlines-in-pdf-file
by DSM

I'm just passing along the good suggestion in hopes that the change may become standard in some future version.

jusdino and others added 2 commits March 19, 2017 10:41
These changes allow for an optional text separator for TJ and Tj operators.

These source alterations were originally suggested in StackOverflow at:
http://stackoverflow.com/questions/11017379/pypdf-ignores-newlines-in-pdf-file
by DSM

I'm just passing along the good suggestion in hopes that the change may become standard in some future version.
@MartinThoma MartinThoma added PdfReader The PdfReader component is affected Feature labels Apr 6, 2022
@MartinThoma
Copy link
Member

Do you have an example where something else than a single whitespace would be desired?

@MartinThoma
Copy link
Member

By the way: Sorry that it took so long to react! I do realize that you propably don't even remember this PR.

Also: Don't worry about the failing tests; that is expected for this PR.

@jusdino
Copy link
Contributor Author

jusdino commented Apr 7, 2022

Yeah, this was a while ago... Ok, I resurrected the project I was working on.

So I was trying to extract text from some form-formatted pdf pages which had newlines separating the text I was interested in, so I used page.extractText(Tj_sep='\n') to get it organized the way I needed.

PyPDF2/pdf.py Outdated Show resolved Hide resolved
@MartinThoma MartinThoma merged commit 12c7047 into py-pdf:master Apr 7, 2022
MartinThoma added a commit that referenced this pull request Apr 7, 2022
Features:

 - Add alpha channel support for png files in Script (#614)

Bug fixes (BUG):

 - Fix formatWarning for filename without slash (#612)
 - Add whitespace between words for extractText() (#569, #334)
 - "invalid escape sequence" SyntaxError (#522)
 - Avoid error when printing warning in pythonw (#486)
 - Stream operations can be List or Dict (#665)

Documentation (DOC):

 - Added Scripts/pdf-image-extractor.py
 - Documentation improvements (#550, #538, #324, #426, #394)

Tests and Test setup (TST):

 - Add Github Action which automatically run unit tests via pytest and
   static code analysis with Flake8 (#660)
 - Add several unit tests (#661, #663)
 - Add .coveragerc to create coverage reports

Developer Experience Improvements (DEV):

 - Pre commit: Developers can now `pre-commit install` to avoid tiny issues
               like trailing whitespaces

Miscallenious:

 - Add the LICENSE file to the distributed packages (#288)
 - Use setuptools instead of distutils (#599)
 - Improvements for the PyPI page (#644)
 - Python 3 changes (#504, #366)

You can see the full changelog at: 1.26.0...1.27.0
@jusdino jusdino deleted the patch-1 branch April 8, 2022 02:32
@MartinThoma MartinThoma added is-feature A feature request and removed Feature labels Jun 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-feature A feature request PdfReader The PdfReader component is affected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants