-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zopen
changes: forbid implicit binary/text mode
, signature change, default UTF-8 encoding
in text mode, drop .z
support after one-year
#730
zopen
changes: forbid implicit binary/text mode
, signature change, default UTF-8 encoding
in text mode, drop .z
support after one-year
#730
Conversation
WalkthroughThe pull request introduces significant enhancements to file handling and encoding practices across multiple modules in the codebase. Key changes include the introduction of an Changes
Assessment against linked issues
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #730 +/- ##
==========================================
+ Coverage 82.57% 82.71% +0.14%
==========================================
Files 27 27
Lines 1584 1597 +13
Branches 284 288 +4
==========================================
+ Hits 1308 1321 +13
Misses 215 215
Partials 61 61 ☔ View full report in Codecov by Sentry. |
zopen
explicitly use UTF-8 encoding
in text modezopen
explicitly use UTF-8 encoding
in text mode, and disallow implicit binary/text mode
zopen
explicitly use UTF-8 encoding
in text mode, and disallow implicit binary/text modezopen
explicitly use UTF-8 encoding
in text mode, and disallow implicit binary/text mode
zopen
explicitly use UTF-8 encoding
in text mode, and disallow implicit binary/text mode
zopen
changes: forbit implicit binary/text mode
, signature change, default UTF-8 encoding
in text mode
zopen
changes: forbit implicit binary/text mode
, signature change, default UTF-8 encoding
in text modezopen
changes: forbid implicit binary/text mode
, signature change, default UTF-8 encoding
in text mode
77171ff
to
3d17a62
Compare
3d17a62
to
671e8d4
Compare
zopen
changes: forbid implicit binary/text mode
, signature change, default UTF-8 encoding
in text modezopen
changes: forbid implicit binary/text mode
, signature change, default UTF-8 encoding
in text mode, drop .z
support after one-year
b2e5e67
to
40ac109
Compare
893c914
to
39d622b
Compare
This comment was marked as outdated.
This comment was marked as outdated.
23c1c02
to
4bbab69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (3)
src/monty/io.py (1)
28-33
: Consider makingmode
a required argument to enforce explicit usageCurrently,
mode
has a default value ofNone
, and a warning is issued when it's not specified. To encourage explicit specification and prevent future breaking changes, consider removing the default value and makingmode
a required argument. This enforces best practices now and aligns with the planned deprecation of defaultmode
.tasks.py (1)
Line range hint
141-146
: Consider error handling for pyproject.toml operationsWhile the UTF-8 encoding is correct, consider adding error handling for file operations.
- with open("pyproject.toml", encoding="utf-8") as f: - contents = f.read() - contents = re.sub(r"version = ([\.\d\"]+)", f'version = "{version}"', contents) - - with open("pyproject.toml", "w", encoding="utf-8") as f: - f.write(contents) + try: + with open("pyproject.toml", encoding="utf-8") as f: + contents = f.read() + contents = re.sub(r"version = ([\.\d\"]+)", f'version = "{version}"', contents) + + with open("pyproject.toml", "w", encoding="utf-8") as f: + f.write(contents) + except IOError as e: + raise RuntimeError(f"Failed to update version in pyproject.toml: {e}") from etests/test_tempfile.py (1)
22-22
: Add tests for non-ASCII contentWhile the UTF-8 encoding addition is good, the tests only write ASCII content ("write"). Consider adding tests with non-ASCII content to verify encoding handling.
+ def test_with_non_ascii_content(self): + test_content = "Hello, 世界! 🌍" + with ScratchDir(self.scratch_root) as d: + with open("unicode_test.txt", "w", encoding="utf-8") as f: + f.write(test_content) + with open("unicode_test.txt", "r", encoding="utf-8") as f: + content = f.read() + assert content == test_contentAlso applies to: 30-30, 52-52, 63-63, 77-77, 86-86, 112-112, 131-131
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (7)
tests/test_files/myfile.gz
is excluded by!**/*.gz
tests/test_files/myfile_bz2.bz2.gz
is excluded by!**/*.gz
tests/test_files/myfile_gz.gz
is excluded by!**/*.gz
tests/test_files/myfile_lzma.lzma.gz
is excluded by!**/*.gz
tests/test_files/myfile_txt.gz
is excluded by!**/*.gz
tests/test_files/myfile_xz.xz
is excluded by!**/*.xz
tests/test_files/myfile_xz.xz.gz
is excluded by!**/*.gz
📒 Files selected for processing (12)
src/monty/io.py
(1 hunks)src/monty/json.py
(3 hunks)src/monty/re.py
(1 hunks)src/monty/serialization.py
(4 hunks)tasks.py
(3 hunks)tests/test_files/myfile
(0 hunks)tests/test_files/myfile_txt
(0 hunks)tests/test_io.py
(8 hunks)tests/test_os.py
(1 hunks)tests/test_serialization.py
(1 hunks)tests/test_shutil.py
(9 hunks)tests/test_tempfile.py
(7 hunks)
💤 Files with no reviewable changes (2)
- tests/test_files/myfile
- tests/test_files/myfile_txt
🔇 Additional comments (14)
src/monty/re.py (1)
52-52
: Explicitly specifying UTF-8 encoding ensures consistent text handling
Adding encoding="utf-8"
when opening the file in text mode with zopen
promotes consistent behavior across different environments and prevents potential decoding errors with non-ASCII characters.
tests/test_serialization.py (1)
85-85
: Adding explicit UTF-8 encoding enhances test robustness
Specifying encoding="utf-8"
when opening the JSON file ensures the test behaves consistently, especially in environments with different default encodings, and properly handles any Unicode characters.
tasks.py (2)
70-72
: LGTM: Proper encoding for index.md modifications
The UTF-8 encoding ensures proper handling of the table of contents and layout configurations.
42-42
: LGTM: Consistent UTF-8 encoding for documentation files
The addition of UTF-8 encoding for documentation files is appropriate, especially for handling special characters in API documentation.
Let's verify if there are any special characters in the documentation that would benefit from this change:
Also applies to: 62-62
✅ Verification successful
Adding UTF-8 encoding is necessary and beneficial
The verification confirms that many documentation files contain non-ASCII characters (smart quotes in license text and special characters in documentation). Using UTF-8 encoding is the correct approach to handle these characters properly.
- Found non-ASCII characters in multiple documentation files including
docs/monty.termcolor.md
- The change ensures consistent and correct handling of special characters across all documentation files
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for non-ASCII characters in documentation files
rg -l "[^\x00-\x7F]" "docs/*.md"
Length of output: 84
Script:
#!/bin/bash
# First, let's find the correct documentation files location
fd -t f ".md$"
# Then check for any non-ASCII characters in these files
fd -t f ".md$" --exec rg -l "[^\x00-\x7F]" {}
Length of output: 1315
Script:
#!/bin/bash
# Let's check what kind of non-ASCII characters are present in one of these files
# to confirm they're actually documentation content
rg --no-line-number "[^\x00-\x7F]" "./docs/monty.termcolor.md" -C 2
Length of output: 713
src/monty/serialization.py (2)
70-70
: LGTM: Consistent encoding for file loading
The addition of UTF-8 encoding for text file loading is appropriate and aligns with the project's encoding standardization goals.
123-125
: Verify type casting necessity
The type casting of fp
to TextIO
might be redundant as zopen
with "wt" mode should already return a text file object.
Let's check the zopen implementation:
✅ Verification successful
The zopen
implementation shows that it returns a union type IO | bz2.BZ2File | gzip.GzipFile | lzma.LZMAFile
and in its docstring it specifies the return type as TextIO | BinaryIO | bz2.BZ2File | gzip.GzipFile | lzma.LZMAFile
. Since the function can return different types based on the file extension, and the type system might not be able to narrow down the specific return type when used with "wt" mode, the type casting to TextIO
is actually necessary for type safety.
Type casting to TextIO is necessary
The explicit cast helps the type checker understand that we're dealing with a text file handle, as zopen
can return multiple different file types and the type system might not narrow it down automatically based on the "wt" mode.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check zopen's return type
ast-grep --pattern 'def zopen($$$) -> $_: $$$'
Length of output: 4910
tests/test_shutil.py (3)
29-31
: LGTM: Explicit UTF-8 encoding added consistently.
The changes properly specify UTF-8 encoding for all file operations in the TestCopyR class, aligning with the PR's objective to enforce explicit encoding.
Also applies to: 34-36, 56-56
73-73
: LGTM: UTF-8 encoding consistently applied to compression tests.
The changes properly specify UTF-8 encoding across all file operations in the TestCompressFileDir class, maintaining consistency with the PR's encoding requirements.
Also applies to: 88-88, 120-120
130-132
: LGTM: Comprehensive UTF-8 encoding coverage in gzip tests.
The changes properly specify UTF-8 encoding for all file operations in the TestGzipDir class, including both read and write operations, ensuring consistent encoding behavior.
Also applies to: 155-155, 166-166, 176-176
tests/test_io.py (4)
32-32
: LGTM: UTF-8 encoding added to reverse readline tests.
The changes properly specify UTF-8 encoding for all file operations in the TestReverseReadline class, ensuring consistent encoding behavior.
Also applies to: 42-42, 64-64
78-80
: LGTM: Proper handling of line endings with UTF-8 encoding.
The changes correctly implement UTF-8 encoding while maintaining proper line ending handling for both Unix/Mac and Windows environments.
Also applies to: 83-83, 93-98, 101-101
181-204
: LGTM: Comprehensive test coverage for file operations.
The new parameterized test method provides thorough coverage of:
- Multiple file extensions (.txt, .bz2, .gz, .xz, .lzma)
- Both text and binary modes
- Proper UTF-8 encoding specification
- Read and write operations
227-263
: LGTM: Comprehensive warning behavior testing.
The new test method thoroughly verifies the warning system for:
- Default UTF-8 encoding behavior
- Implicit text/binary mode usage
- Default mode usage
- Proper handling across different file extensions
src/monty/json.py (1)
449-449
: LGTM: UTF-8 encoding added to JSON file operations.
The changes properly specify UTF-8 encoding for JSON file read/write operations, ensuring consistent encoding behavior for serialization and deserialization.
Also applies to: 502-502
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
src/monty/io.py (3)
25-25
: Consider adding version-specific implementation.The current implementation backports
EncodingWarning
from Python 3.10, but it might be better to use the built-in class when available.Consider this implementation:
-class EncodingWarning(Warning): ... # Added in Python 3.10 +import sys +if sys.version_info >= (3, 10): + from warnings import EncodingWarning +else: + class EncodingWarning(Warning): ... # Backport from Python 3.10
28-33
: Consider using more specific return type hint.The current return type hint uses a union of generic
IO
with specific file types. For better type safety, consider usingUnion[BinaryIO, TextIO]
instead of genericIO
.-) -> IO | bz2.BZ2File | gzip.GzipFile | lzma.LZMAFile: +) -> Union[BinaryIO, TextIO, bz2.BZ2File, gzip.GzipFile, lzma.LZMAFile]:
58-90
: Consider centralizing deadline management.The deadline is currently hardcoded in the function. Consider moving it to a module-level constant for easier maintenance and consistency.
+# At module level +IMPLICIT_MODE_DEADLINE = "2025-06-01" - _deadline: str = "2025-06-01" + _deadline: str = IMPLICIT_MODE_DEADLINE
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/monty/io.py
(1 hunks)
🔇 Additional comments (2)
src/monty/io.py (2)
99-108
:
Implement proper handling for .z files.
As discussed in previous reviews, the current implementation doesn't properly handle .z files, which could lead to incorrect behavior since Python's gzip module cannot handle LZW compression.
Consider raising an exception instead:
if ext == ".z":
- # TODO: drop ".z" extension support after 2026-01-01
- warnings.warn(
- "Python gzip is not able to (de)compress LZW-compressed files. "
- "You should rename it to .gz. Support for the '.z' extension will "
- "be removed after 2026-01-01.",
- category=FutureWarning,
- stacklevel=2,
- )
- return gzip.open(filename, mode, **kwargs)
+ raise NotImplementedError(
+ "Support for '.z' files is not implemented. "
+ "Python's gzip module cannot decompress LZW-compressed '.z' files. "
+ "Please use an alternative method to handle '.z' files or rename to .gz "
+ "if the file is actually gzip compressed."
+ )
83-90
: Verify impact of UTF-8 default encoding.
The change to use UTF-8 as default encoding aligns with PEP 686, but we should verify this doesn't break existing code that might rely on system default encoding.
✅ Verification successful
Let me try a different approach to gather the necessary information.
Based on the search results, I can now generate the final response:
UTF-8 default encoding change is safe to proceed
The codebase already consistently uses UTF-8 encoding across all file operations:
- All explicit file operations in tests use
encoding="utf-8"
- Text mode operations through
zopen()
insrc/monty/re.py
andsrc/monty/serialization.py
already specify UTF-8 - No instances found of file operations relying on system default encoding
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for potential encoding-sensitive file operations
rg -l "encoding.*=(?!.*utf)" . | while read -r file; do
echo "=== $file ==="
rg "encoding.*=(?!.*utf)" "$file" -A 2
done
# Search for potential binary file operations that might be affected
ast-grep --pattern 'zopen($_, "r")'
Length of output: 384
Script:
#!/bin/bash
# Search for file operations with explicit encoding
rg -A 2 'encoding\s*='
# Search for open operations in text mode
ast-grep --pattern 'open($_, "r")'
ast-grep --pattern 'open($_, "rt")'
# Search for potential file read operations
rg -A 2 '\.read\(\)'
Length of output: 11034
category=EncodingWarning, | ||
stacklevel=2, | ||
) | ||
kwargs["encoding"] = "utf-8" | ||
|
||
_name, ext = os.path.splitext(filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess splittext
got updated at some point? It works directly with str
as path now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
src/monty/io.py (3)
25-25
: Consider enhancing the EncodingWarning documentationWhile the implementation is correct, adding a docstring would help clarify its purpose and usage, especially since it appears to be a backport of Python 3.10's EncodingWarning.
-class EncodingWarning(Warning): ... # Added in Python 3.10 +class EncodingWarning(Warning): + """Warning category for encoding-related warnings. + + This is a backport of Python 3.10's EncodingWarning to maintain + compatibility with earlier Python versions. + """ + ...
58-58
: Consider using a datetime object for the deadlineThe deadline is currently a hardcoded string. Consider using a datetime object for better validation and comparison capabilities.
-_deadline: str = "2025-06-01" +from datetime import datetime +_deadline: datetime = datetime(2025, 6, 1)
95-110
: Consider using a dictionary for extension mappingThe current implementation uses multiple if statements for extension mapping. Consider using a dictionary for better maintainability and performance.
+_COMPRESSION_HANDLERS = { + ".bz2": bz2.open, + ".gz": gzip.open, + ".xz": lzma.open, + ".lzma": lzma.open, +} -if ext == ".bz2": - return bz2.open(filename, mode, **kwargs) -if ext == ".gz": - return gzip.open(filename, mode, **kwargs) -if ext == ".z": - # TODO: drop ".z" extension support after 2026-01-01 - warnings.warn( - "Python gzip is not able to (de)compress LZW-compressed files. " - "You should rename it to .gz. Support for the '.z' extension will " - "be removed after 2026-01-01.", - category=FutureWarning, - stacklevel=2, - ) - return gzip.open(filename, mode, **kwargs) -if ext in {".xz", ".lzma"}: - return lzma.open(filename, mode, **kwargs) +handler = _COMPRESSION_HANDLERS.get(ext) +if handler: + return handler(filename, mode, **kwargs)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
src/monty/io.py
(1 hunks)tests/test_files/real_lzw_file.txt.Z
(1 hunks)tests/test_io.py
(8 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/test_files/real_lzw_file.txt.Z
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_io.py
🔇 Additional comments (4)
src/monty/io.py (4)
17-22
: LGTM: Well-organized imports with proper type hints
The new imports are properly organized and the type hints are correctly conditioned with TYPE_CHECKING for runtime optimization.
28-56
: LGTM: Well-designed function signature with clear documentation
The updated signature with positional-only parameter and comprehensive type hints improves clarity. The docstring effectively communicates the upcoming breaking changes and usage requirements.
82-90
: LGTM: Proper implementation of UTF-8 default encoding
The implementation correctly follows PEP 686 by defaulting to UTF-8 encoding and warning users about implicit encoding usage.
99-108
:
Consider immediate deprecation of .z extension
Since Python's gzip module cannot properly handle LZW-compressed .z files, continuing to support them might lead to silent corruption or incorrect behavior. Consider raising a NotImplementedError instead of just warning.
if ext == ".z":
- # TODO: drop ".z" extension support after 2026-01-01
- warnings.warn(
- "Python gzip is not able to (de)compress LZW-compressed files. "
- "You should rename it to .gz. Support for the '.z' extension will "
- "be removed after 2026-01-01.",
- category=FutureWarning,
- stacklevel=2,
- )
- return gzip.open(filename, mode, **kwargs)
+ raise NotImplementedError(
+ "Support for '.z' files is not implemented as Python's gzip module "
+ "cannot handle LZW-compressed files. Please use '.gz' extension instead."
+ )
Summary
encoding
as default in text mode whenencoding
is not given, follow up Explicit UTF-8 encoding for VASP input files withzopen
, andopen
for other text files materialsproject/pymatgen#4218, to fix For band structure, how to correctly display the symbol of the \Gamma point materialsproject/pymatgen#4214mode
, the plan is giving a half-year grace time (I prefer to enforce a shorter grace time as this is not deprecation but more a fix for a dangerous behaviour), after that passing amode
withoutb/t
would directly trigger an exceptionzopen
signature changepymatgen
: commit, CI logSignature change for
zopen
file/filename
as positional, AFAIK, there is no usage aswith zopen(filename=)
(i.e.filename
as kwarg).open
(file
) andgzip/bz2/lzma.open
(filename
)filename/mode
keyword only.filename
andmode
, passing as positional args would give very unpredictable results:Forbid implicit binary/text mode for
zopen
mode
is not set, orb/t
is not given inmode
Rationale:
Using implicit binary/text
mode
(nob/t
inmode
) withzopen
would cause different behaviour for text file (file.txt
, asTextIO
) vs compressed file (file.gz
, asBinaryIO
), as default mode foropen
is "text" while default mode for compressedopen
is binary, as such we would get different data types from the stream (string vs bytes).A
read
example:A
write
example:Default "UTF-8"
encoding
in text modeRationale see materialsproject/pymatgen#4218 (comment)
Drop support for
.z
extension after one-year grace periodI don't think Python's built-in
gzip
could (de)compress a real LZW-algorithm file (not a file (de)compressed bygzip
DEFLATE algorithm and named to.z
):For example:
However user might have already used something like
with zopen("text.z", mode="wt") as file:
to generate such a file (looks like LZW but actually DEFLATE), and as such we wouldn't drop.z
extension directly here, a grace period of one-year would be added.