Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure pg_dump_splitsort.py compatibility with PyPy #31

Open
akaihola opened this issue Apr 24, 2024 · 1 comment · May be fixed by #36
Open

Ensure pg_dump_splitsort.py compatibility with PyPy #31

akaihola opened this issue Apr 24, 2024 · 1 comment · May be fixed by #36
Assignees
Milestone

Comments

@akaihola
Copy link
Owner

In #13, @oldcai writes:

I tested it with a 444G .sql file and it finished within 523m22.189s.

Let's make sure pg_dump_splitsort.py runs with PyPy, and document the performance gain.

@akaihola akaihola added this to the 1.1.1 milestone Apr 24, 2024
@akaihola akaihola self-assigned this Apr 24, 2024
@akaihola
Copy link
Owner Author

Traceback (most recent call last):
  File "bin/pg_dump_splitsort", line 8, in <module>
    sys.exit(main())
  File "pgtricks/pg_dump_splitsort.py", line 156, in main
    split_sql_file(args.sql_filepath, args.max_memory)
  File "pgtricks/pg_dump_splitsort.py", line 109, in split_sql_file
    sorted_data_lines = MergeSort(
  File "pgtricks/mergesort.py", line 28, in __init__
    self._memory_counter: int = sys.getsizeof(self._buffer)
TypeError: getsizeof(...)
    getsizeof(object, default) -> int

    Return the size of object in bytes.

sys.getsizeof(object, default) will always return default on PyPy, and raise a TypeError if default is not provided.

First note that the CPython documentation says that this function may raise a TypeError, so if you are seeing it, it means that the program you are using is not correctly handling this case.

On PyPy, though, it always raises TypeError. Before looking for alternatives, please take a moment to read the following explanation as to why it is the case. What you are looking for may not be possible.

A memory profiler using this function is most likely to give results inconsistent with reality on PyPy. It would be possible to have sys.getsizeof() return a number (with enough work), but that may or may not represent how much memory the object uses. It doesn't even make really sense to ask how much one object uses, in isolation with the rest of the system. For example, instances have maps, which are often shared across many instances; in this case the maps would probably be ignored by an implementation of sys.getsizeof(), but their overhead is important in some cases if they are many instances with unique maps. Conversely, equal strings may share their internal string data even if they are different objects---or empty containers may share parts of their internals as long as they are empty. Even stranger, some lists create objects as you read them; if you try to estimate the size in memory of range(10**6) as the sum of all items' size, that operation will by itself create one million integer objects that never existed in the first place.

@akaihola akaihola linked a pull request Apr 25, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant