New line character missing and URLs adding periods and space #1974
Labels
whitespace
While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.
workflow-text-extraction
From a users perspective, text extraction is the affected feature/workflow
2 issues to report. Not sure if these are bugs or feature.
First, often, end of line words are concatenated with begining of next line words.
For example:
I used pypdf on the following PDF (but the same occurs in other PDF)
https://www.aircanada.com/content/dam/aircanada/portal/documents/PDF/en/corporate-sustainability/2021-cs-report.pdf
https://s2.q4cdn.com/470004039/files/doc_downloads/2022/08/2022_Apple_ESG_Report.pdf
In the first few lines of the output we see:
Immediately, there a few inaccuracies:
The page we are trying to convert has many columns and I suspect there is missing a newline character.
Second Space are added to urls. Consider what I have found in the output:
"www. aircanada. com/ citizensoftheworld"
I hope this helps.
Environment
Google Colab
The text was updated successfully, but these errors were encountered: