-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-94526: getpath_dirname() no longer encodes the path #97645
Conversation
Well. In fact, the issue is broader: no only _bootstrap_python is affected, any |
I rebased and updated the PR to clarify that this issue affects the Python path configuration (sys.path creation). |
Sadly, There are getpath_methods which are injected inside a namespace (dict) by funcs_to_dict() function. It may be interesting to convert it to a regular extension module ( |
Misc/NEWS.d/next/Core and Builtins/2022-09-29-15-19-29.gh-issue-94526.wq5m6T.rst
Outdated
Show resolved
Hide resolved
const char *path; | ||
if (!PyArg_ParseTuple(args, "s", &path)) { | ||
PyObject *path; | ||
if (!PyArg_ParseTuple(args, "U", &path)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I would use METH_O
and PyArg_Parse()
in these functions, but this is another issue.
Why cannot they be implemented in Python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I would use METH_O and PyArg_Parse() in these functions, but this is another issue.
I tried to minimize the changes.
Why cannot they be implemented in Python?
Ask @zooba who designed this. Maybe it can be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perf, mostly. These trivial ones probably could be, but don't fall into the trap of trying to port the full ntpath/posixpath implementations into getpath - we don't have a lot of the functionality needed to handle those at this stage (e.g. no codecs, no os module).
Fix the Python path configuration used to initialized sys.path at Python startup. Paths are no longer encoded to UTF-8/strict to avoid encoding errors if it contains surrogate characters (bytes paths are decoded with the surrogateescape error handler). getpath_basename() and getpath_dirname() functions no longer encode the path to UTF-8/strict, but work directly on Unicode strings. These functions now use PyUnicode_FindChar() and PyUnicode_Substring() on the Unicode path, rather than strrchr() on the encoded bytes string.
I rephrased the NEWS entry to omit function names. Is it better? I only named functions in the commit message. |
Although this fixes the issue, this is a bit fragile since it can break if any other function were to use utf8 handler for encoding. This PR avoids the case, can you add a comment that utf8 should be avoided here? |
Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.11. |
Sorry @vstinner, I had trouble checking out the |
Yes, a regression can be introduced again tomorrow. Well, we can fix it in this case :-)
I'm not sure about the intent of a comment explaining that UTF-8 should not be used, since the modified functions now use Unicode (no encode/decode). |
Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.11. |
GH-97677 is a backport of this pull request to the 3.11 branch. |
…H-97645) Fix the Python path configuration used to initialized sys.path at Python startup. Paths are no longer encoded to UTF-8/strict to avoid encoding errors if it contains surrogate characters (bytes paths are decoded with the surrogateescape error handler). getpath_basename() and getpath_dirname() functions no longer encode the path to UTF-8/strict, but work directly on Unicode strings. These functions now use PyUnicode_FindChar() and PyUnicode_Substring() on the Unicode path, rather than strrchr() on the encoded bytes string. (cherry picked from commit 9f2f1dd) Co-authored-by: Victor Stinner <[email protected]>
Fix the Python path configuration used to initialized sys.path at Python startup. Paths are no longer encoded to UTF-8/strict to avoid encoding errors if it contains surrogate characters (bytes paths are decoded with the surrogateescape error handler). getpath_basename() and getpath_dirname() functions no longer encode the path to UTF-8/strict, but work directly on Unicode strings. These functions now use PyUnicode_FindChar() and PyUnicode_Substring() on the Unicode path, rather than strrchr() on the encoded bytes string. (cherry picked from commit 9f2f1dd) Co-authored-by: Victor Stinner <[email protected]>
…97645) Fix the Python path configuration used to initialized sys.path at Python startup. Paths are no longer encoded to UTF-8/strict to avoid encoding errors if it contains surrogate characters (bytes paths are decoded with the surrogateescape error handler). getpath_basename() and getpath_dirname() functions no longer encode the path to UTF-8/strict, but work directly on Unicode strings. These functions now use PyUnicode_FindChar() and PyUnicode_Substring() on the Unicode path, rather than strrchr() on the encoded bytes string.
Fix the Python path configuration used to initialized sys.path at
Python startup. getpath_basename() and getpath_dirname() functions no
longer encode the path to UTF-8/strict to avoid encoding errors if it
contains surrogate characters (created by decoding a bytes path with
the surrogateescape error handler).
The functions now use PyUnicode_FindChar() and PyUnicode_Substring()
on the Unicode path, rather than strrchr() on the encoded bytes
string.