New line character missing and URLs adding periods and space #1974

AlexNguyen124 · 2023-07-17T23:00:44Z

2 issues to report. Not sure if these are bugs or feature.

First, often, end of line words are concatenated with begining of next line words.
For example:
I used pypdf on the following PDF (but the same occurs in other PDF)
https://www.aircanada.com/content/dam/aircanada/portal/documents/PDF/en/corporate-sustainability/2021-cs-report.pdf
https://s2.q4cdn.com/470004039/files/doc_downloads/2022/08/2022_Apple_ESG_Report.pdf

In the first few lines of the output we see:

Citizens of the World
2021 Corporate Sustainability ReportCitizens of the World 2021
Corporate Sustainability Report 2Contents
INTRODUCTION  3
 —About our report 3
• Reporting framework 4
• Third-party assurance 4
 —Corporate sustainability at Air Canada 5

Immediately, there a few inaccuracies:

2nd line: "Report" and "Citizens" should be separated
3rd line "2" and "Contents"

The page we are trying to convert has many columns and I suspect there is missing a newline character.

Second Space are added to urls. Consider what I have found in the output:
"www. aircanada. com/ citizensoftheworld"

I hope this helps.

Environment

Google Colab

doc = PdfReader(path_to_pdf)
text = ""
path_to_txt = os.path.join(txt_path, "pypdf", fname) + ".txt"
print(path_to_txt)
for page in doc.pages:
    text += page.extract_text()
out = open(path_to_txt, "w")  # create a text output
out.write(text)
out.close()

The text was updated successfully, but these errors were encountered:

MartinThoma · 2023-07-18T10:18:48Z

All whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. issues are notoriously hard to deal with. This might not get resolved any time soon (or not at all).

stefan6419846 · 2024-10-03T13:14:59Z

According to #2882 (comment), this has just been fixed.

MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. labels Jul 18, 2023

MartinThoma mentioned this issue Jul 29, 2023

ENH: Extract LaTeX characters #2016

Merged

ssjkamei mentioned this issue Oct 2, 2024

BUG: Issue in text extraction (spaces) (#1153) #2882

Merged

stefan6419846 closed this as completed Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New line character missing and URLs adding periods and space #1974

New line character missing and URLs adding periods and space #1974

AlexNguyen124 commented Jul 17, 2023 •

edited by MartinThoma

Loading

MartinThoma commented Jul 18, 2023

stefan6419846 commented Oct 3, 2024

New line character missing and URLs adding periods and space #1974

New line character missing and URLs adding periods and space #1974

Comments

AlexNguyen124 commented Jul 17, 2023 • edited by MartinThoma Loading

Environment

MartinThoma commented Jul 18, 2023

stefan6419846 commented Oct 3, 2024

AlexNguyen124 commented Jul 17, 2023 •

edited by MartinThoma

Loading