Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New line character missing and URLs adding periods and space #1974

Closed
AlexNguyen124 opened this issue Jul 17, 2023 · 2 comments
Closed

New line character missing and URLs adding periods and space #1974

AlexNguyen124 opened this issue Jul 17, 2023 · 2 comments
Labels
whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@AlexNguyen124
Copy link

AlexNguyen124 commented Jul 17, 2023

2 issues to report. Not sure if these are bugs or feature.

First, often, end of line words are concatenated with begining of next line words.
For example:
I used pypdf on the following PDF (but the same occurs in other PDF)
https://www.aircanada.com/content/dam/aircanada/portal/documents/PDF/en/corporate-sustainability/2021-cs-report.pdf
https://s2.q4cdn.com/470004039/files/doc_downloads/2022/08/2022_Apple_ESG_Report.pdf

In the first few lines of the output we see:

Citizens of the World
2021 Corporate Sustainability ReportCitizens of the World 2021
Corporate Sustainability Report 2Contents
INTRODUCTION  3
 —About our report 3
• Reporting framework 4
• Third-party assurance 4
 —Corporate sustainability at Air Canada 5

Immediately, there a few inaccuracies:

  • 2nd line: "Report" and "Citizens" should be separated
  • 3rd line "2" and "Contents"

The page we are trying to convert has many columns and I suspect there is missing a newline character.

Second Space are added to urls. Consider what I have found in the output:
"www. aircanada. com/ citizensoftheworld"

I hope this helps.

Environment

Google Colab

doc = PdfReader(path_to_pdf)
text = ""
path_to_txt = os.path.join(txt_path, "pypdf", fname) + ".txt"
print(path_to_txt)
for page in doc.pages:
    text += page.extract_text()
out = open(path_to_txt, "w")  # create a text output
out.write(text)
out.close()
@MartinThoma MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. labels Jul 18, 2023
@MartinThoma
Copy link
Member

All whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. issues are notoriously hard to deal with. This might not get resolved any time soon (or not at all).

@stefan6419846
Copy link
Collaborator

According to #2882 (comment), this has just been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

3 participants