Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: remove redundant checks on data integrity #4433

Merged
merged 8 commits into from
Dec 2, 2024

Conversation

caic99
Copy link
Member

@caic99 caic99 commented Nov 27, 2024

Systems are aggregated here

systems = process_systems(systems)

and later initialized here
data = DeepmdDataSystem(
systems=systems,

This process will instantiate DeepmdData class, and it will perform data integrity checks

assert (
optional_type_map or self.type_map is not None
), f"System {sys_path} must have type_map.raw in this mode! "

Besides, the checking process enumerates all items for all ranks, which is unnecessary and quite slow. So this PR removes this check.

Summary by CodeRabbit

  • New Features

    • Enhanced flexibility in defining test sizes by allowing percentage input for the test_size parameter.
    • Introduced a new method to automatically compute test sizes based on the specified percentage of total data.
    • Improved path handling to accept both string and Path inputs, enhancing usability.
  • Bug Fixes

    • Improved error handling for invalid paths, ensuring users receive clear feedback when files are not found.
  • Deprecation Notice

    • The get_test method is now deprecated, with new logic implemented for loading test data when necessary.

Copy link
Contributor

coderabbitai bot commented Nov 27, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The changes in the DeepmdDataSystem class within deepmd/utils/data_system.py enhance input validation by requiring at least one system in the constructor and allowing the test_size parameter to accept a string representing a percentage. A new private method, _make_auto_ts, computes test sizes based on this percentage. The get_test method is deprecated and now ensures that test data is loaded when accessed. Additionally, the DPOSPath and DPH5Path classes in deepmd/utils/path.py have improved type handling and error checking for path inputs. The DeepmdData class in deepmd/utils/data.py now includes error handling for invalid directory paths during instantiation.

Changes

File Change Summary
deepmd/utils/data_system.py - Updated constructor to accept test_size as Union[int, str>.
- Added _make_auto_ts method for calculating test sizes based on percentage.
- Deprecated get_test method with new loading logic.
- Simplified error handling with assertions for system presence.
deepmd/utils/path.py - Updated DPOSPath constructor to accept path as Union[str, Path] and added FileNotFoundError for non-existent paths.
- Added similar existence check in DPH5Path for root_path.
deepmd/utils/data.py - Added error handling in DeepmdData constructor to check for valid sys_path directory.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DeepmdDataSystem
    participant DataLoader

    User->>DeepmdDataSystem: Initialize with systems, batch_size, test_size as percentage
    DeepmdDataSystem->>DeepmdDataSystem: Check if test_size is string
    DeepmdDataSystem->>DeepmdDataSystem: Call _make_auto_ts to calculate test size
    DeepmdDataSystem->>DataLoader: Load test data if not already loaded
    DataLoader-->>DeepmdDataSystem: Return loaded test data
    DeepmdDataSystem-->>User: Provide test data
Loading

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
deepmd/utils/data_system.py (2)

Line range hint 516-538: Add proper deprecation warning mechanism.

While the method is marked as deprecated in the comment, it should use Python's built-in deprecation warning system to notify users.

Apply this change:

    def get_test(self, sys_idx: Optional[int] = None, n_test: int = -1):  # depreciated
+        import warnings
+        warnings.warn(
+            "get_test is deprecated and will be removed in a future version.",
+            DeprecationWarning,
+            stacklevel=2
+        )
        """Get test data from the the data systems.

Line range hint 673-680: Add input validation and documentation.

The new method would benefit from:

  1. Input validation for the percentage value
  2. Proper docstring documentation

Consider applying these improvements:

     def _make_auto_ts(self, percent):
+        """Calculate test size for each system based on percentage.
+
+        Parameters
+        ----------
+        percent : int
+            Percentage of data to use for testing (0-100)
+
+        Returns
+        -------
+        list
+            List of test sizes for each system
+        """
+        if not 0 <= percent <= 100:
+            raise ValueError("Percentage must be between 0 and 100")
         ts = []
         for ii in range(self.nsystems):
             ni = self.batch_size[ii] * self.nbatches[ii]
             tsi = int(ni * percent / 100)
             ts.append(tsi)

         return ts
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 36884d2 and beab41c.

📒 Files selected for processing (1)
  • deepmd/utils/data_system.py (1 hunks)
🔇 Additional comments (2)
deepmd/utils/data_system.py (2)

758-758: LGTM: Clear and descriptive error message.

The error message clearly communicates the issue when no valid data systems are found.


Line range hint 12-24: LGTM: Enhanced test size configuration.

The changes provide more flexibility in specifying test sizes:

  • As a percentage of system data
  • Per-system individually
  • With proper validation and error handling

Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 62.50000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 83.75%. Comparing base (f343a3b) to head (89a3882).
Report is 12 commits behind head on devel.

Files with missing lines Patch % Lines
deepmd/utils/data.py 50.00% 1 Missing ⚠️
deepmd/utils/data_system.py 50.00% 1 Missing ⚠️
deepmd/utils/path.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #4433      +/-   ##
==========================================
- Coverage   84.64%   83.75%   -0.90%     
==========================================
  Files         614      667      +53     
  Lines       57138    61516    +4378     
  Branches     3487     3487              
==========================================
+ Hits        48367    51525    +3158     
- Misses       7646     8864    +1218     
- Partials     1125     1127       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

deepmd/utils/data_system.py Outdated Show resolved Hide resolved
Co-authored-by: Jinzhe Zeng <[email protected]>
Signed-off-by: Chun Cai <[email protected]>
@njzjz
Copy link
Member

njzjz commented Nov 27, 2024

This process will instantiate DeepmdData class, and it will perform data integrity checks

I don't get your point. It seems one checks type.raw and the other one checks type_map.raw.

@caic99
Copy link
Member Author

caic99 commented Nov 28, 2024

It seems one checks type.raw and the other one checks type_map.raw.

@njzjz What I mean is that those files are read in the init process of DeepmdData, including type.raw as shown above, and other files required like type_map.raw.

self.mixed_type = self._check_mode(self.dirs[0])
for set_item in self.dirs[1:]:
assert self._check_mode(set_item) == self.mixed_type, error_format_msg
# load atom type
self.atom_type = self._load_type(root)
self.natoms = len(self.atom_type)
# load atom type map
self.type_map = self._load_type_map(root)

Thus, the data integrity has been checked in this process.

@njzjz
Copy link
Member

njzjz commented Nov 28, 2024

I got what you mean. Please also check the usage of process_systems below

training_systems = process_systems(training_systems)
if validation_systems is not None:
validation_systems = process_systems(validation_systems)

@caic99
Copy link
Member Author

caic99 commented Nov 28, 2024

Please also check the usage of process_systems below

Nice catch.

The systems are instantiated at

train_data_single = DpLoaderSet(
training_systems,

def construct_dataset(system):
return DeepmdDataSetForLoader(

self._data_system = DeepmdData(sys_path=system, type_map=self._type_map)

So it'll be OK.
Merging those two functions into one might me a better choice, but maybe in another PR.

@caic99 caic99 requested a review from njzjz November 29, 2024 02:25
@njzjz
Copy link
Member

njzjz commented Nov 29, 2024

It seems one checks type.raw and the other one checks type_map.raw.

@njzjz What I mean is that those files are read in the init process of DeepmdData, including type.raw as shown above, and other files required like type_map.raw.

self.mixed_type = self._check_mode(self.dirs[0])
for set_item in self.dirs[1:]:
assert self._check_mode(set_item) == self.mixed_type, error_format_msg
# load atom type
self.atom_type = self._load_type(root)
self.natoms = len(self.atom_type)
# load atom type map
self.type_map = self._load_type_map(root)

Thus, the data integrity has been checked in this process.

I read data.py and I don't think it checks whether the given path exists or not.

@caic99
Copy link
Member Author

caic99 commented Nov 29, 2024

I read data.py and I don't think it checks whether the given path exists or not.

def _load_type(self, sys_path: DPPath):
atom_type = (sys_path / "type.raw").load_txt(ndmin=1).astype(np.int32)
return atom_type
def _load_type_mix(self, set_name: DPPath):
type_path = set_name / "real_atom_types.npy"
real_type = type_path.load_numpy().astype(np.int32).reshape([-1, self.natoms])
return real_type

@njzjz It will raise an exception here if some path does not exist.

@njzjz
Copy link
Member

njzjz commented Nov 29, 2024

@njzjz It will raise an exception here if some path does not exist.

The error message seems to be less clear, which needs to be improved.

>>> import h5py
>>> with h5py.File("abc.hdf5", "w") as f:
...     pass
...
>>> from deepmd.utils.data import DeepmdData
>>> DeepmdData("abc.hdf5#azsffd")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/njzjz/codes/deepmd-kit/deepmd/utils/data.py", line 65, in __init__
    raise FileNotFoundError(f"No {set_prefix}.* is found in {sys_path}")
FileNotFoundError: No set.* is found in abc.hdf5#azsffd
>>> from deepmd.utils.data_system import process_systems
>>> process_systems(["abc.hdf5#azsffd"])
dir abc.hdf5#azsffd is not a valid dir
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/njzjz/codes/deepmd-kit/deepmd/utils/data_system.py", line 770, in process_systems
    raise OSError(msg, help_msg)
OSError: [Errno dir abc.hdf5#azsffd is not a valid dir] Please check your setting for data systems

@caic99
Copy link
Member Author

caic99 commented Nov 29, 2024

The error message seems to be less clear, which needs to be improved.

@njzjz You mean the error message from process_systems? Any suggestions?

@njzjz
Copy link
Member

njzjz commented Nov 29, 2024

The error message seems to be less clear, which needs to be improved.

@njzjz You mean the error message from process_systems? Any suggestions?

I mean the error message from DeepmdData, when you remove the error messages from process_systems. It does not report the input path is invalid.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (5)
deepmd/utils/path.py (3)

170-172: Enhance error message and consider path creation scenarios

The path existence check is good for early validation, but consider:

  1. Making the error message more descriptive by including the intended operation
  2. Adding a parameter to optionally skip the existence check when the path will be created later
 def __init__(self, path: Union[str, Path], mode: str = "r") -> None:
     super().__init__()
     self.mode = mode
     self.path = Path(path)
-    if not self.path.exists():
-        raise FileNotFoundError(f"{self.path} not found")
+    if not self.path.exists():
+        raise FileNotFoundError(
+            f"Path '{self.path}' does not exist. "
+            f"Operation: initializing DPOSPath in {mode} mode"
+        )

303-304: Improve error message for HDF5 file validation

While the file existence check is good, the error message could be more descriptive to help users understand the context.

-    if not os.path.isfile(self.root_path):
-        raise FileNotFoundError(f"{self.root_path} not found")
+    if not os.path.isfile(self.root_path):
+        raise FileNotFoundError(
+            f"HDF5 file '{self.root_path}' does not exist or is not a regular file. "
+            f"Operation: initializing DPH5Path in {mode} mode"
+        )

170-172: Consider separating path validation improvements

While the path validation improvements are valuable, they seem tangential to the PR's main objective of removing redundant data integrity checks. Consider:

  1. Moving these path handling improvements to a separate PR focused on path validation
  2. This would make the current PR more focused on its performance optimization goal

Also applies to: 303-304

deepmd/utils/data_system.py (2)

103-103: Enhance the assertion error message for better debugging.

While the assertion is a good practice for failing fast, the error message could be more descriptive to help users understand what went wrong.

-        assert self.nsystems > 0, "No systems provided"
+        assert self.nsystems > 0, "No valid systems provided. Please check the 'systems' parameter in your configuration."

Line range hint 484-485: Add proper deprecation warning.

The method is marked as deprecated in the docstring, but it should raise a proper deprecation warning to help users transition to the new API.

    def get_test(self, sys_idx: Optional[int] = None, n_test: int = -1):  # depreciated
+        warnings.warn(
+            "get_test is deprecated and will be removed in a future version.",
+            DeprecationWarning,
+            stacklevel=2
+        )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 0e1944a and b093fd7.

📒 Files selected for processing (2)
  • deepmd/utils/data_system.py (2 hunks)
  • deepmd/utils/path.py (3 hunks)
🔇 Additional comments (2)
deepmd/utils/path.py (2)

17-17: LGTM: Clean type import addition

The addition of Union import supports the enhanced type hints in the DPOSPath class.


161-161: LGTM: Enhanced type flexibility

The change to accept both str and Path types improves API flexibility while maintaining type safety.

Also applies to: 167-167

deepmd/utils/data_system.py Outdated Show resolved Hide resolved
deepmd/utils/data_system.py Outdated Show resolved Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Chun Cai <[email protected]>
Signed-off-by: Chun Cai <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
deepmd/utils/data_system.py (3)

Line range hint 756-764: Add input validation for percentage and ensure minimum test size

The method should validate the input percentage range and ensure a minimum test size to prevent zero-sized test sets.

     def _make_auto_ts(self, percent):
+        if not 0 <= percent <= 100:
+            raise ValueError(f"Test size percentage must be between 0 and 100, got {percent}")
         ts = []
         for ii in range(self.nsystems):
             ni = self.batch_size[ii] * self.nbatches[ii]
             tsi = int(ni * percent / 100)
+            # Ensure at least one test sample when percentage is non-zero
+            if percent > 0 and tsi == 0:
+                tsi = 1
             ts.append(tsi)

         return ts

Line range hint 684-684: Add proper deprecation warning with migration path

The method is marked as deprecated but lacks a proper warning. Also, "depreciated" is misspelled (should be "deprecated").

-    def get_test(self, sys_idx: Optional[int] = None, n_test: int = -1):  # depreciated
+    def get_test(self, sys_idx: Optional[int] = None, n_test: int = -1):
+        """Get test data from the data systems.
+        
+        .. deprecated:: 2.0.0
+           Use get_test_data() instead. This method will be removed in version 3.0.0.
+        """
+        import warnings
+        warnings.warn(
+            "get_test() is deprecated and will be removed in version 3.0.0. "
+            "Use get_test_data() instead.",
+            DeprecationWarning,
+            stacklevel=2
+        )

Line range hint 1-764: Consider additional optimization opportunities

While the current changes improve test data handling, there might be more opportunities to remove redundant data integrity checks as mentioned in the PR objectives. Consider reviewing:

  1. The type map consistency check in _check_type_map_consistency
  2. The batch size checks in the constructor
  3. The system probability validation in process_sys_probs

These areas might benefit from similar optimizations to improve performance.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b093fd7 and 7331472.

📒 Files selected for processing (1)
  • deepmd/utils/data_system.py (1 hunks)

deepmd/utils/data_system.py Outdated Show resolved Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Chun Cai <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
deepmd/utils/data_system.py (2)

Line range hint 716-738: Add proper deprecation warning.

The method is marked as deprecated in comments, but it should use Python's built-in deprecation warning mechanism to properly inform users.

Add a deprecation warning:

     def get_test(self, sys_idx: Optional[int] = None, n_test: int = -1):  # depreciated
+        import warnings
+        warnings.warn(
+            "get_test is deprecated and will be removed in a future version.",
+            DeprecationWarning,
+            stacklevel=2
+        )
         if not hasattr(self, "test_data"):
             self._load_test(ntests=n_test)

Line range hint 1-766: Performance optimization approach looks good.

The changes successfully remove redundant data integrity checks while maintaining proper error handling where necessary. The new percentage-based test size feature adds flexibility without compromising the core functionality.

Consider these future improvements:

  1. Add logging for important state changes and error conditions
  2. Consider implementing a caching strategy for frequently accessed data
  3. Add metrics collection points to measure the performance impact of these changes
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7331472 and fd59449.

📒 Files selected for processing (1)
  • deepmd/utils/data_system.py (1 hunks)
🔇 Additional comments (2)
deepmd/utils/data_system.py (2)

103-104: LGTM! Proper exception handling implemented.

Good replacement of assertion with proper exception handling, following Python best practices.


Line range hint 757-766: Consider handling edge cases in test size calculation.

The current implementation might result in zero test size for small datasets when using percentage-based calculation. For example, if batch_size * nbatches = 10 and percent = 5, the calculation int(10 * 5 / 100) would yield 0.

Consider adding a minimum test size check:

     def _make_auto_ts(self, percent):
         ts = []
         for ii in range(self.nsystems):
             ni = self.batch_size[ii] * self.nbatches[ii]
             tsi = int(ni * percent / 100)
+            # Ensure at least one test sample if percentage is non-zero
+            if percent > 0 and tsi == 0:
+                tsi = 1
             ts.append(tsi)
         return ts

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
deepmd/utils/data.py (2)

63-64: LGTM! Consider enhancing the error message.

The directory existence check is a good addition that provides early validation. However, the error message could be more descriptive.

Consider this enhanced error message:

-            raise FileNotFoundError(f"System {sys_path} is not found!")
+            raise FileNotFoundError(f"System directory '{sys_path}' does not exist or is not accessible!")

63-64: Validation sequence is well-structured.

The new directory existence check is a good addition that complements the existing validation sequence. Each subsequent check serves a distinct purpose, maintaining a clear separation of concerns:

  1. Directory existence (basic filesystem)
  2. Set prefix (data organization)
  3. Mixed type (data format)
  4. Atom types (data content)
  5. PBC settings (simulation parameters)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between fd59449 and 89a3882.

📒 Files selected for processing (2)
  • deepmd/utils/data.py (1 hunks)
  • deepmd/utils/path.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/utils/path.py

@caic99 caic99 requested a review from njzjz December 2, 2024 05:25
@njzjz njzjz added this pull request to the merge queue Dec 2, 2024
Merged via the queue into deepmodeling:devel with commit 3917cf0 Dec 2, 2024
60 checks passed
@caic99 caic99 deleted the perf-datasys branch December 3, 2024 01:57
@njzjz njzjz added this to the v3.0.1 milestone Dec 6, 2024
njzjz pushed a commit to njzjz/deepmd-kit that referenced this pull request Dec 22, 2024
Systems are aggregated here

https://github.com/deepmodeling/deepmd-kit/blob/f343a3b212edab5525502e0261f3068c0b6fb1f6/deepmd/utils/data_system.py#L802
and later initialized here

https://github.com/deepmodeling/deepmd-kit/blob/f343a3b212edab5525502e0261f3068c0b6fb1f6/deepmd/utils/data_system.py#L809-L810

This process will instantiate `DeepmdData` class, and it will perform
data integrity checks

https://github.com/deepmodeling/deepmd-kit/blob/e695a91ca6f7a1c9c830ab1c58b7b7a05db3da23/deepmd/utils/data.py#L80-L82

Besides, the checking process enumerates all items for all ranks, which
is unnecessary and quite slow. So this PR removes this check.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced flexibility in defining test sizes by allowing percentage
input for the `test_size` parameter.
- Introduced a new method to automatically compute test sizes based on
the specified percentage of total data.
- Improved path handling to accept both string and Path inputs,
enhancing usability.

- **Bug Fixes**
- Improved error handling for invalid paths, ensuring users receive
clear feedback when files are not found.

- **Deprecation Notice**
- The `get_test` method is now deprecated, with new logic implemented
for loading test data when necessary.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Chun Cai <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jinzhe Zeng <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
(cherry picked from commit 3917cf0)
njzjz pushed a commit that referenced this pull request Dec 23, 2024
Systems are aggregated here

https://github.com/deepmodeling/deepmd-kit/blob/f343a3b212edab5525502e0261f3068c0b6fb1f6/deepmd/utils/data_system.py#L802
and later initialized here

https://github.com/deepmodeling/deepmd-kit/blob/f343a3b212edab5525502e0261f3068c0b6fb1f6/deepmd/utils/data_system.py#L809-L810

This process will instantiate `DeepmdData` class, and it will perform
data integrity checks

https://github.com/deepmodeling/deepmd-kit/blob/e695a91ca6f7a1c9c830ab1c58b7b7a05db3da23/deepmd/utils/data.py#L80-L82

Besides, the checking process enumerates all items for all ranks, which
is unnecessary and quite slow. So this PR removes this check.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced flexibility in defining test sizes by allowing percentage
input for the `test_size` parameter.
- Introduced a new method to automatically compute test sizes based on
the specified percentage of total data.
- Improved path handling to accept both string and Path inputs,
enhancing usability.

- **Bug Fixes**
- Improved error handling for invalid paths, ensuring users receive
clear feedback when files are not found.

- **Deprecation Notice**
- The `get_test` method is now deprecated, with new logic implemented
for loading test data when necessary.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Chun Cai <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jinzhe Zeng <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
(cherry picked from commit 3917cf0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants