-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf: remove redundant checks on data integrity #4433
Conversation
📝 Walkthrough📝 WalkthroughWalkthroughThe changes in the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant DeepmdDataSystem
participant DataLoader
User->>DeepmdDataSystem: Initialize with systems, batch_size, test_size as percentage
DeepmdDataSystem->>DeepmdDataSystem: Check if test_size is string
DeepmdDataSystem->>DeepmdDataSystem: Call _make_auto_ts to calculate test size
DeepmdDataSystem->>DataLoader: Load test data if not already loaded
DataLoader-->>DeepmdDataSystem: Return loaded test data
DeepmdDataSystem-->>User: Provide test data
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
deepmd/utils/data_system.py (2)
Line range hint
516-538
: Add proper deprecation warning mechanism.While the method is marked as deprecated in the comment, it should use Python's built-in deprecation warning system to notify users.
Apply this change:
def get_test(self, sys_idx: Optional[int] = None, n_test: int = -1): # depreciated + import warnings + warnings.warn( + "get_test is deprecated and will be removed in a future version.", + DeprecationWarning, + stacklevel=2 + ) """Get test data from the the data systems.
Line range hint
673-680
: Add input validation and documentation.The new method would benefit from:
- Input validation for the percentage value
- Proper docstring documentation
Consider applying these improvements:
def _make_auto_ts(self, percent): + """Calculate test size for each system based on percentage. + + Parameters + ---------- + percent : int + Percentage of data to use for testing (0-100) + + Returns + ------- + list + List of test sizes for each system + """ + if not 0 <= percent <= 100: + raise ValueError("Percentage must be between 0 and 100") ts = [] for ii in range(self.nsystems): ni = self.batch_size[ii] * self.nbatches[ii] tsi = int(ni * percent / 100) ts.append(tsi) return ts
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
deepmd/utils/data_system.py
(1 hunks)
🔇 Additional comments (2)
deepmd/utils/data_system.py (2)
758-758
: LGTM: Clear and descriptive error message.
The error message clearly communicates the issue when no valid data systems are found.
Line range hint 12-24
: LGTM: Enhanced test size configuration.
The changes provide more flexibility in specifying test sizes:
- As a percentage of system data
- Per-system individually
- With proper validation and error handling
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## devel #4433 +/- ##
==========================================
- Coverage 84.64% 83.75% -0.90%
==========================================
Files 614 667 +53
Lines 57138 61516 +4378
Branches 3487 3487
==========================================
+ Hits 48367 51525 +3158
- Misses 7646 8864 +1218
- Partials 1125 1127 +2 ☔ View full report in Codecov by Sentry. |
Co-authored-by: Jinzhe Zeng <[email protected]> Signed-off-by: Chun Cai <[email protected]>
I don't get your point. It seems one checks |
@njzjz What I mean is that those files are read in the init process of deepmd-kit/deepmd/utils/data.py Lines 72 to 79 in e695a91
Thus, the data integrity has been checked in this process. |
I got what you mean. Please also check the usage of deepmd-kit/deepmd/pt/entrypoints/main.py Lines 122 to 124 in e695a91
|
Nice catch. The systems are instantiated at deepmd-kit/deepmd/pt/entrypoints/main.py Lines 152 to 153 in a852aa9
deepmd-kit/deepmd/pt/utils/dataloader.py Lines 90 to 91 in a852aa9
deepmd-kit/deepmd/pt/utils/dataset.py Line 28 in a852aa9
So it'll be OK. |
I read data.py and I don't think it checks whether the given path exists or not. |
deepmd-kit/deepmd/utils/data.py Lines 668 to 675 in e695a91
@njzjz It will raise an exception here if some path does not exist. |
The error message seems to be less clear, which needs to be improved. >>> import h5py
>>> with h5py.File("abc.hdf5", "w") as f:
... pass
...
>>> from deepmd.utils.data import DeepmdData
>>> DeepmdData("abc.hdf5#azsffd")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/njzjz/codes/deepmd-kit/deepmd/utils/data.py", line 65, in __init__
raise FileNotFoundError(f"No {set_prefix}.* is found in {sys_path}")
FileNotFoundError: No set.* is found in abc.hdf5#azsffd
>>> from deepmd.utils.data_system import process_systems
>>> process_systems(["abc.hdf5#azsffd"])
dir abc.hdf5#azsffd is not a valid dir
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/njzjz/codes/deepmd-kit/deepmd/utils/data_system.py", line 770, in process_systems
raise OSError(msg, help_msg)
OSError: [Errno dir abc.hdf5#azsffd is not a valid dir] Please check your setting for data systems |
@njzjz You mean the error message from |
I mean the error message from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (5)
deepmd/utils/path.py (3)
170-172
: Enhance error message and consider path creation scenariosThe path existence check is good for early validation, but consider:
- Making the error message more descriptive by including the intended operation
- Adding a parameter to optionally skip the existence check when the path will be created later
def __init__(self, path: Union[str, Path], mode: str = "r") -> None: super().__init__() self.mode = mode self.path = Path(path) - if not self.path.exists(): - raise FileNotFoundError(f"{self.path} not found") + if not self.path.exists(): + raise FileNotFoundError( + f"Path '{self.path}' does not exist. " + f"Operation: initializing DPOSPath in {mode} mode" + )
303-304
: Improve error message for HDF5 file validationWhile the file existence check is good, the error message could be more descriptive to help users understand the context.
- if not os.path.isfile(self.root_path): - raise FileNotFoundError(f"{self.root_path} not found") + if not os.path.isfile(self.root_path): + raise FileNotFoundError( + f"HDF5 file '{self.root_path}' does not exist or is not a regular file. " + f"Operation: initializing DPH5Path in {mode} mode" + )
170-172
: Consider separating path validation improvementsWhile the path validation improvements are valuable, they seem tangential to the PR's main objective of removing redundant data integrity checks. Consider:
- Moving these path handling improvements to a separate PR focused on path validation
- This would make the current PR more focused on its performance optimization goal
Also applies to: 303-304
deepmd/utils/data_system.py (2)
103-103
: Enhance the assertion error message for better debugging.While the assertion is a good practice for failing fast, the error message could be more descriptive to help users understand what went wrong.
- assert self.nsystems > 0, "No systems provided" + assert self.nsystems > 0, "No valid systems provided. Please check the 'systems' parameter in your configuration."
Line range hint
484-485
: Add proper deprecation warning.The method is marked as deprecated in the docstring, but it should raise a proper deprecation warning to help users transition to the new API.
def get_test(self, sys_idx: Optional[int] = None, n_test: int = -1): # depreciated + warnings.warn( + "get_test is deprecated and will be removed in a future version.", + DeprecationWarning, + stacklevel=2 + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
deepmd/utils/data_system.py
(2 hunks)deepmd/utils/path.py
(3 hunks)
🔇 Additional comments (2)
deepmd/utils/path.py (2)
17-17
: LGTM: Clean type import addition
The addition of Union
import supports the enhanced type hints in the DPOSPath
class.
161-161
: LGTM: Enhanced type flexibility
The change to accept both str
and Path
types improves API flexibility while maintaining type safety.
Also applies to: 167-167
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Chun Cai <[email protected]>
Signed-off-by: Chun Cai <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
deepmd/utils/data_system.py (3)
Line range hint
756-764
: Add input validation for percentage and ensure minimum test sizeThe method should validate the input percentage range and ensure a minimum test size to prevent zero-sized test sets.
def _make_auto_ts(self, percent): + if not 0 <= percent <= 100: + raise ValueError(f"Test size percentage must be between 0 and 100, got {percent}") ts = [] for ii in range(self.nsystems): ni = self.batch_size[ii] * self.nbatches[ii] tsi = int(ni * percent / 100) + # Ensure at least one test sample when percentage is non-zero + if percent > 0 and tsi == 0: + tsi = 1 ts.append(tsi) return ts
Line range hint
684-684
: Add proper deprecation warning with migration pathThe method is marked as deprecated but lacks a proper warning. Also, "depreciated" is misspelled (should be "deprecated").
- def get_test(self, sys_idx: Optional[int] = None, n_test: int = -1): # depreciated + def get_test(self, sys_idx: Optional[int] = None, n_test: int = -1): + """Get test data from the data systems. + + .. deprecated:: 2.0.0 + Use get_test_data() instead. This method will be removed in version 3.0.0. + """ + import warnings + warnings.warn( + "get_test() is deprecated and will be removed in version 3.0.0. " + "Use get_test_data() instead.", + DeprecationWarning, + stacklevel=2 + )
Line range hint
1-764
: Consider additional optimization opportunitiesWhile the current changes improve test data handling, there might be more opportunities to remove redundant data integrity checks as mentioned in the PR objectives. Consider reviewing:
- The type map consistency check in
_check_type_map_consistency
- The batch size checks in the constructor
- The system probability validation in
process_sys_probs
These areas might benefit from similar optimizations to improve performance.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Chun Cai <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
deepmd/utils/data_system.py (2)
Line range hint
716-738
: Add proper deprecation warning.The method is marked as deprecated in comments, but it should use Python's built-in deprecation warning mechanism to properly inform users.
Add a deprecation warning:
def get_test(self, sys_idx: Optional[int] = None, n_test: int = -1): # depreciated + import warnings + warnings.warn( + "get_test is deprecated and will be removed in a future version.", + DeprecationWarning, + stacklevel=2 + ) if not hasattr(self, "test_data"): self._load_test(ntests=n_test)
Line range hint
1-766
: Performance optimization approach looks good.The changes successfully remove redundant data integrity checks while maintaining proper error handling where necessary. The new percentage-based test size feature adds flexibility without compromising the core functionality.
Consider these future improvements:
- Add logging for important state changes and error conditions
- Consider implementing a caching strategy for frequently accessed data
- Add metrics collection points to measure the performance impact of these changes
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
deepmd/utils/data_system.py
(1 hunks)
🔇 Additional comments (2)
deepmd/utils/data_system.py (2)
103-104
: LGTM! Proper exception handling implemented.
Good replacement of assertion with proper exception handling, following Python best practices.
Line range hint 757-766
: Consider handling edge cases in test size calculation.
The current implementation might result in zero test size for small datasets when using percentage-based calculation. For example, if batch_size * nbatches = 10
and percent = 5
, the calculation int(10 * 5 / 100)
would yield 0.
Consider adding a minimum test size check:
def _make_auto_ts(self, percent):
ts = []
for ii in range(self.nsystems):
ni = self.batch_size[ii] * self.nbatches[ii]
tsi = int(ni * percent / 100)
+ # Ensure at least one test sample if percentage is non-zero
+ if percent > 0 and tsi == 0:
+ tsi = 1
ts.append(tsi)
return ts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
deepmd/utils/data.py (2)
63-64
: LGTM! Consider enhancing the error message.The directory existence check is a good addition that provides early validation. However, the error message could be more descriptive.
Consider this enhanced error message:
- raise FileNotFoundError(f"System {sys_path} is not found!") + raise FileNotFoundError(f"System directory '{sys_path}' does not exist or is not accessible!")
63-64
: Validation sequence is well-structured.The new directory existence check is a good addition that complements the existing validation sequence. Each subsequent check serves a distinct purpose, maintaining a clear separation of concerns:
- Directory existence (basic filesystem)
- Set prefix (data organization)
- Mixed type (data format)
- Atom types (data content)
- PBC settings (simulation parameters)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
deepmd/utils/data.py
(1 hunks)deepmd/utils/path.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- deepmd/utils/path.py
Systems are aggregated here https://github.com/deepmodeling/deepmd-kit/blob/f343a3b212edab5525502e0261f3068c0b6fb1f6/deepmd/utils/data_system.py#L802 and later initialized here https://github.com/deepmodeling/deepmd-kit/blob/f343a3b212edab5525502e0261f3068c0b6fb1f6/deepmd/utils/data_system.py#L809-L810 This process will instantiate `DeepmdData` class, and it will perform data integrity checks https://github.com/deepmodeling/deepmd-kit/blob/e695a91ca6f7a1c9c830ab1c58b7b7a05db3da23/deepmd/utils/data.py#L80-L82 Besides, the checking process enumerates all items for all ranks, which is unnecessary and quite slow. So this PR removes this check. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced flexibility in defining test sizes by allowing percentage input for the `test_size` parameter. - Introduced a new method to automatically compute test sizes based on the specified percentage of total data. - Improved path handling to accept both string and Path inputs, enhancing usability. - **Bug Fixes** - Improved error handling for invalid paths, ensuring users receive clear feedback when files are not found. - **Deprecation Notice** - The `get_test` method is now deprecated, with new logic implemented for loading test data when necessary. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chun Cai <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jinzhe Zeng <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> (cherry picked from commit 3917cf0)
Systems are aggregated here https://github.com/deepmodeling/deepmd-kit/blob/f343a3b212edab5525502e0261f3068c0b6fb1f6/deepmd/utils/data_system.py#L802 and later initialized here https://github.com/deepmodeling/deepmd-kit/blob/f343a3b212edab5525502e0261f3068c0b6fb1f6/deepmd/utils/data_system.py#L809-L810 This process will instantiate `DeepmdData` class, and it will perform data integrity checks https://github.com/deepmodeling/deepmd-kit/blob/e695a91ca6f7a1c9c830ab1c58b7b7a05db3da23/deepmd/utils/data.py#L80-L82 Besides, the checking process enumerates all items for all ranks, which is unnecessary and quite slow. So this PR removes this check. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced flexibility in defining test sizes by allowing percentage input for the `test_size` parameter. - Introduced a new method to automatically compute test sizes based on the specified percentage of total data. - Improved path handling to accept both string and Path inputs, enhancing usability. - **Bug Fixes** - Improved error handling for invalid paths, ensuring users receive clear feedback when files are not found. - **Deprecation Notice** - The `get_test` method is now deprecated, with new logic implemented for loading test data when necessary. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chun Cai <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jinzhe Zeng <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> (cherry picked from commit 3917cf0)
Systems are aggregated here
deepmd-kit/deepmd/utils/data_system.py
Line 802 in f343a3b
and later initialized here
deepmd-kit/deepmd/utils/data_system.py
Lines 809 to 810 in f343a3b
This process will instantiate
DeepmdData
class, and it will perform data integrity checksdeepmd-kit/deepmd/utils/data.py
Lines 80 to 82 in e695a91
Besides, the checking process enumerates all items for all ranks, which is unnecessary and quite slow. So this PR removes this check.
Summary by CodeRabbit
New Features
test_size
parameter.Bug Fixes
Deprecation Notice
get_test
method is now deprecated, with new logic implemented for loading test data when necessary.