Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After the pdfplumber program is packaged into an exe(py2exe), some pdfs cannot recognize the content #615

Closed
StruggleYang opened this issue Mar 2, 2022 · 5 comments
Labels

Comments

@StruggleYang
Copy link

StruggleYang commented Mar 2, 2022

Describe the bug

  • When I use an IDE (Pycharm or vscode) and run the source code of the program written by pdfplumber, I can accurately identify the entire content, as well as the table. This works fine on MacOS as well as Windows 10, and the source code behaves consistently.
  • When I use py2exe to package the program and send it to my friend to run, a strange thing happens, some of the original files are not recognized when the exe runs. For example, it turns out that all text information can be recognized in pdf, but cannot be recognized in the program run by exe, including tables; but not all files, this is the strangest place.
  • I can't send the source code directly to my friend to get it working properly because he's not a professional and doesn't know how to use it.
  • I'm wondering where am I going wrong, is the package missing a dependency file? what's the problem

Code to reproduce the problem

I write like most pdfplumber programs.
I don't see any problem as the code runs fine,The problem is inconsistent behavior after packaging.

  with pdfplumber.open(file) as pdf:
      logger.info('start read file:%s' % file)
      for page in pdf.pages:
          all_content = page.extract_text(x_tolerance=0, y_tolerance=0)
          if DEBUG:
              logger.info('DEBUG===========%s-page%d,page object[%d]' % (file, page.page_number, len(page.objects)))
              logger.info(all_content)

PDF file

I'm sorry.
The text I read is about privacy and I can't provide it conveniently, but if a maintainer follows up on this issue, I'd be willing to email him.

Expected behavior

For the same file, I want the code runner and package runner to recognize the same results

Actual behavior

As I mentioned above, when the source code is running and when it is packaged as an exe, the results of some file recognition are inconsistent; it is worth mentioning that everything is normal through py2app, only py2exe will have such a problem

Screenshots

  • This is the result of running the source code. The rules of logging can refer to the above code. What you can see is that page1 is rich in content, and you can't even see the log on the next page.

image

  • This is the same file, using the program exe packaged by py2exe, after running and identifying, only the title can be seen on the first page, and there is no more content.

image

Environment

  • pdfplumber version: v0.6.0
  • Python version: 3.7.3
  • OS: Window10

Additional context

Also, I tried it both on my personal Windows 10 computer and in a virtual machine (windows 10) running in MacOS. The results were also disappointing.

@jsvine
Copy link
Owner

jsvine commented Mar 3, 2022

Hi @StruggleYang, and thanks for your interest in this library. I'm not very familiar with py2exe, and so I don't have many ideas about what could be happening. But I wonder if part of the issue is the logging. What if, instead of logging, you write the output to a text file? Do you see the same problem?

@StruggleYang
Copy link
Author

Hi @StruggleYang, and thanks for your interest in this library. I'm not very familiar with py2exe, and so I don't have many ideas about what could be happening. But I wonder if part of the issue is the logging. What if, instead of logging, you write the output to a text file? Do you see the same problem?

Hi @jsvine , thanks for getting back to me on this issue, like you said, I also suspected it was a log encoding issue. So I've made some attempts to change the log to stdout or to a file, but still the same problem. I have written dedicated test code that only includes pdfplumber to exclude other dependencies from affecting the results.


Below is my new test code

# coding:utf-8
import time
import pdfplumber
import os

if __name__ == '__main__':
    print("hello")
    user_path = os.path.expanduser('~')
    print(user_path)
    with pdfplumber.open(os.path.join(user_path, "working", "pdftest", "test.pdf")) as pdf:
        for page in pdf.pages:
            all_content = page.extract_text(x_tolerance=0, y_tolerance=0)
            tables = page.extract_table()
            print(page.page_number)
            print(tables)
            print(all_content)
    print("end")
    time.sleep(10)  # Make sure to run after packaging to see consistent results
  • I made a new test, this is running the above code in Pycharm

image

  • This is the result of running it after packaging as an exe, what I have to tell you is that both runs read the same file.But the results are quite different

image

I have the full test code and packaging script, and a pdf that reproduces the problem, if you can help me, I can mail it to you after packaging. Because the document involves privacy, it is not convenient for me to disclose it. If there is no way, then I can only think of other alternatives. But that could be a fresh start.

@jsvine
Copy link
Owner

jsvine commented Mar 4, 2022

Thank you for sharing those details. One other thing to test: Do you see the same problems when using pdfminer.six? That's the library that pdfplumber uses to extract information about all the characters in the PDF.

One other way to get closer to the answer: What is the result of len(pdf.pages[0].chars) (a) in the standard script and (b) in the py2exe-packaged script?

@StruggleYang
Copy link
Author

StruggleYang commented Mar 5, 2022

Thank you for sharing those details. One other thing to test: Do you see the same problems when using pdfminer.six? That's the library that pdfplumber uses to extract information about all the characters in the PDF.

One other way to get closer to the answer: What is the result of len(pdf.pages[0].chars) (a) in the standard script and (b) in the py2exe-packaged script?

Thank you for your reply, I will perform the test you said and reply the results. However, it is worth mentioning that I just changed the packaging method to Pyinstaller , the result of running is normal, I have reason to suspect that this problem is caused by packaging tools.

Perhaps because of the different principles of packaging tools, some dependence is lacking when packaging. I will slowly compare their differences when I am free. After performing the test you said, maybe I will close this problem, because I have found that the problem may not be in the current warehouse (I think there is still necessary test to position to the truth). Maybe I will go to py2exe . I will launch a question. Thank you again, and your communication is very happy.

@jsvine
Copy link
Owner

jsvine commented Jul 20, 2022

Closing this issue due to inactivity and factors that appear to be outside of pdfplumber's control. Feel free to continue the discussion, however, especially if there is new information to add. Thanks again.

@jsvine jsvine closed this as completed Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants