(fix) Make bias statistics complete for all elements #4496

SumGuo-88 · 2024-12-23T12:33:06Z

Summary by CodeRabbit

New Features
- Enhanced functionality to handle and report missing elements in datasets based on atomic types.
- Improved mapping of element types to frame indices for better data organization.
- Introduced a new method to retrieve frames for specific element types.
Tests
- Added unit tests for the make_stat_input function to ensure accurate processing of atomic types.

for more information, see https://pre-commit.ci

coderabbitai · 2024-12-23T12:36:11Z

📝 Walkthrough

Walkthrough

The pull request introduces modifications in the DeepMD-kit's PyTorch utility modules. In dataset.py, a new private method _build_element_to_frames is added to the DeepmdDataSetForLoader class, which constructs a mapping of element types to their corresponding frame indexes. In stat.py, the make_stat_input function is enhanced to manage missing atomic types by retrieving frame data from datasets, thereby ensuring comprehensive statistical representation across different element types. Additionally, a new test file is created to validate the functionality of the updated make_stat_input function.

Changes

File	Change Summary
`deepmd/pt/utils/dataset.py`	Added private method `_build_element_to_frames()` to map element types to frame indexes and added public method `get_frames_for_element()`. Updated constructor to call the new method.
`deepmd/pt/utils/stat.py`	Enhanced `make_stat_input()` function to identify and collect data for missing atomic types from datasets.
`source/tests/pt/test_make_stat_input.py`	Introduced unit tests for `make_stat_input`, including classes `TestDataset` and `TestMakeStatInput` with relevant test methods.

Possibly related PRs

Fix: Atomic stat with multi-system #4370: The changes in this PR enhance the handling of atomic types in the make_stat_input function, which relies on the element_to_frames attribute introduced in the main PR, establishing a direct connection between the two.
refactor: simplify dataset construction #4437: This PR introduces a new function construct_dataset in the DpLoaderSet class that creates instances of DeepmdDataSetForLoader, which is directly related to the new methods added in the main PR for managing element frames.

Suggested reviewers

njzjz
CaRoLZhangxy
wanghan-iapcm

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

deepmd/pt/utils/stat.py (1)
110-119: Avoid potential index-out-of-range errors when concatenating Tensors.
The code checks “isinstance(sys_stat[key], list) and isinstance(sys_stat[key][0], torch.Tensor)” without verifying that the list is non-empty. This might raise an IndexError if the list is empty. Consider adding a length check before accessing sys_stat[key][0].

Possible safe check:
-if isinstance(sys_stat[key], list) and isinstance(sys_stat[key][0], torch.Tensor):
+if isinstance(sys_stat[key], list) and len(sys_stat[key]) > 0 \
+   and isinstance(sys_stat[key][0], torch.Tensor):

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfe17a3 and 32da243.

📒 Files selected for processing (2)

deepmd/pt/utils/dataset.py (1 hunks)
deepmd/pt/utils/stat.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.8.2)

deepmd/pt/utils/dataset.py

44-44: SyntaxError: Expected an indented block after function definition

🔇 Additional comments (3)

deepmd/pt/utils/stat.py (3)

91-96: Validate 'atype' presence and handle empty arrays gracefully.
While updating the collected elements, consider explicitly checking that each 'atype' array is non-empty. If a dataset yields an empty array, it might bypass your missing-elements logic.

✅ Verification successful

Let me gather more information about how atype is handled in the codebase.

Let me try again with the correct file type flag for ripgrep.

Based on the code analysis, I can now generate the final response:

Empty arrays are handled safely by numpy.unique and set.update
The code uses numpy.unique() which safely handles empty arrays by returning an empty array, and set.update() which safely handles empty iterables. The logic for missing elements works correctly even with empty arrays since an empty array's unique values would simply not add any elements to the set.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash
# Search for patterns related to 'atype' handling and array emptiness checks
rg -A 3 "atype" --type python

# Also look for the specific class/function containing this code
ast-grep --pattern $'def $_(self, $_):
    $$$
    collected_elements = set()
    for sys_stat in lst:
        if "atype" in sys_stat:
            collected_elements.update($$$)
    $$$'

Length of output: 248

Script:

#!/bin/bash
# Search for patterns related to 'atype' handling and array emptiness checks
rg -A 3 "atype" -t py

# Look for similar array handling patterns in the codebase
rg -t py "numpy\.unique|\.update.*numpy" -A 2

# Find the specific file we're analyzing
fd "stat.py" --type f

Length of output: 89317

85-90: Consider using logging and verifying that all datasets are included.
You are only collecting the set of elements from the first dataset and simply printing them. If your intent is to gather possible elements across all datasets, you may want to iterate over all datasets. Also, using "log.info" would be more consistent than a raw "print" in a library setting.

Possible fix for the print statement:

-        print('we want', all_elements)
+        log.info(f"We want elements: {all_elements}")

97-109: Ensure partial coverage logic for missing elements is intended.
Here, frames are appended until the number of records reaches 'nbatches'. Verify that the short-circuiting with “break” actually meets your coverage needs for missing elements. If your goal is to gather as many frames as possible across all missing elements, you might need a more explicit limit or a different approach.

✅ Verification successful

Let me gather more context about this code to better understand the coverage logic for missing elements.

Let me search for more context about the initialization and usage of nbatches to better understand the coverage requirements.

Let me search for the initialization of nbatches to understand its purpose in the coverage logic.

Let me search for more context about how nbatches is used and its relationship with the dataset statistics.

Based on the gathered context, I can now provide a final response about the coverage logic for missing elements.

The current partial coverage logic is intentional and correct.

The code's behavior is appropriate because:

nbatches is a parameter that defines the desired number of batches for collecting statistics, used consistently throughout the make_stat_input function
The break condition if len(lst[i]['atype']) >= nbatches ensures each system collects up to the specified number of batches, preventing over-sampling
This is consistent with how regular batches are collected earlier in the function where numb_batches = min(nbatches, len(dataloaders[i])) is used
The goal is to gather a statistically significant sample (controlled by nbatches) rather than exhaustively processing all frames containing missing elements

The current implementation correctly balances between gathering sufficient data for statistics while maintaining consistent batch sizes across all elements.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

Script:

#!/bin/bash
# Let's first look at the broader context of this code
rg -B 10 -A 5 "for missing_element in missing_elements" deepmd/pt/utils/stat.py

# Also search for any related tests that might help understand the intended behavior
fd "test.*stat.*py" -x rg -l "missing_element"

# Look for the initialization of nbatches and lst variables
rg -B 5 "nbatches.*=|lst.*=" deepmd/pt/utils/stat.py

Length of output: 1724

Script:

#!/bin/bash
# Search for the function definition and its parameters
ast-grep --pattern 'def $FUNC($PARAMS) {
  $$$
  nbatches
  $$$
}'  deepmd/pt/utils/stat.py

# Look for any documentation or comments about nbatches
rg -B 3 -A 3 "nbatches" deepmd/pt/utils/stat.py

# Check if there are any tests that verify this behavior
fd "test.*stat.*py" -x cat {}

Length of output: 66294

Script:

#!/bin/bash
# Look for the function definition and documentation
rg -B 10 "def make_stat_input" deepmd/pt/utils/stat.py

# Search for any tests that verify the behavior with missing elements
rg -B 5 -A 5 "missing_element" deepmd/pt/utils/stat.py

Length of output: 1079

Script:

#!/bin/bash
# Look for the full function implementation and its docstring
rg -B 20 -A 50 "def make_stat_input" deepmd/pt/utils/stat.py

# Search for any tests that verify the statistics collection
fd "test.*stat.*py" -x rg -A 10 "make_stat_input"

# Look for the initialization of lst and how frames are collected
rg -B 5 "lst.*=.*\[\]" deepmd/pt/utils/stat.py

Length of output: 4539

coderabbitai · 2024-12-23T12:36:14Z

deepmd/pt/utils/dataset.py

+        def _build_element_to_frames(self):
+        """Mapping element types to frame indexes"""
+        element_to_frames = {element: [] for element in range(self._ntypes)} 
+        for frame_idx in range(len(self)):
+            frame_data = self._data_system.get_item_torch(frame_idx)
+
+            elements = frame_data["atype"]  
+            for element in set(elements):
+                if len(element_to_frames[element]) < 10:  
+                    element_to_frames[element].append(frame_idx)
+        return element_to_frames
+


⚠️ Potential issue

Fix indentation error and consider making the max frame limit configurable.
Static analysis flags a SyntaxError likely due to insufficient indentation after defining the method. Ensure this block is indented so that Python recognizes it as part of the method. Also, the hard-coded limit of 10 frames per element may need to be a configurable parameter if you expect variation in your datasets.

Example indentation fix:

- def _build_element_to_frames(self): - """Mapping element types to frame indexes""" - element_to_frames = {element: [] for element in range(self._ntypes)} + def _build_element_to_frames(self): + """Mapping element types to frame indexes""" + element_to_frames = {element: [] for element in range(self._ntypes)}

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.8.2)

44-44: SyntaxError: Expected an indented block after function definition

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

deepmd/pt/utils/stat.py (2)

89-89: Remove debug print statement.

The print statement print("we want", all_elements) appears to be debug code that should be removed or replaced with proper logging.

-        print("we want", all_elements)
+        log.debug(f"Required elements for statistics: {all_elements}")

97-111: Optimize nested loops and add error handling.

The nested loops for handling missing elements could be optimized, and error handling should be added for invalid frame indices.

         for missing_element in missing_elements:
             for i, dataset in enumerate(datasets):
                 if hasattr(dataset, "element_to_frames"):
-                    frame_indices = dataset.element_to_frames.get(
-                        missing_element, []
-                    )
+                    try:
+                        frame_indices = dataset.element_to_frames.get(missing_element, [])
+                        if not frame_indices:
+                            continue
+                            
+                        # Pre-check if we need more frames
+                        if len(lst[i]["atype"]) >= nbatches:
+                            break
+                            
+                        # Process frames in batch
+                        for frame_idx in frame_indices:
+                            frame_data = dataset[frame_idx]
+                            if any(key not in lst[i] for key in frame_data):
+                                lst[i].update({key: [] for key in frame_data if key not in lst[i]})
+                            for key in frame_data:
+                                lst[i][key].append(frame_data[key])
+                            if len(lst[i]["atype"]) >= nbatches:
+                                break
+                    except Exception as e:
+                        log.warning(f"Error processing frames for element {missing_element}: {e}")
+                        continue
-                    for frame_idx in frame_indices:
-                        if len(lst[i]["atype"]) >= nbatches:
-                            break
-                        frame_data = dataset[frame_idx]
-                        for key in frame_data:
-                            if key not in lst[i]:
-                                lst[i][key] = []
-                            lst[i][key].append(frame_data[key])

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32da243 and adf2315.

📒 Files selected for processing (2)

deepmd/pt/utils/dataset.py (1 hunks)
deepmd/pt/utils/stat.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.8.2)

deepmd/pt/utils/dataset.py

44-44: SyntaxError: Expected an indented block after function definition

🔇 Additional comments (2)

deepmd/pt/utils/dataset.py (2)

43-44: ⚠️ Potential issue

Fix the indentation error in method definition.

The method definition has incorrect indentation which will cause a SyntaxError. It should be aligned with other class methods.

-        def _build_element_to_frames(self):
-        """Mapping element types to frame indexes"""
+    def _build_element_to_frames(self):
+        """Mapping element types to frame indexes"""

Likely invalid or redundant comment.

🧰 Tools

🪛 Ruff (0.8.2)

44-44: SyntaxError: Expected an indented block after function definition

43-53: 🛠️ Refactor suggestion

Make frame limit configurable and enhance documentation.

The hard-coded limit of 10 frames per element should be configurable.
The docstring should be more descriptive about the method's purpose and return value.

-    def _build_element_to_frames(self):
-        """Mapping element types to frame indexes"""
+    def _build_element_to_frames(self, max_frames_per_element: int = 10) -> dict[int, list[int]]:
+        """Build a mapping of element types to their corresponding frame indices.
+        
+        Args:
+            max_frames_per_element: Maximum number of frames to store per element type.
+            
+        Returns:
+            A dictionary mapping element types (int) to lists of frame indices (list[int])
+            where each element type appears.
+        """
         element_to_frames = {element: [] for element in range(self._ntypes)}
         for frame_idx in range(len(self)):
             frame_data = self._data_system.get_item_torch(frame_idx)
 
             elements = frame_data["atype"]
             for element in set(elements):
-                if len(element_to_frames[element]) < 10:
+                if len(element_to_frames[element]) < max_frames_per_element:
                     element_to_frames[element].append(frame_idx)
         return element_to_frames

Likely invalid or redundant comment.

🧰 Tools

🪛 Ruff (0.8.2)

44-44: SyntaxError: Expected an indented block after function definition

coderabbitai · 2024-12-23T12:37:52Z

deepmd/pt/utils/stat.py

+
+    all_elements = set()
+    if datasets and hasattr(datasets[0], "element_to_frames"):
+        all_elements.update(datasets[0].element_to_frames.keys())
+        print("we want", all_elements)
+
+    collected_elements = set()
+    for sys_stat in lst:
+        if "atype" in sys_stat:
+            collected_elements.update(np.unique(sys_stat["atype"].cpu().numpy()))
+            missing_elements = all_elements - collected_elements
+
+            for missing_element in missing_elements:
+                for i, dataset in enumerate(datasets):
+                    if hasattr(dataset, "element_to_frames"):
+                        frame_indices = dataset.element_to_frames.get(
+                            missing_element, []
+                        )
+                        for frame_idx in frame_indices:
+                            if len(lst[i]["atype"]) >= nbatches:
+                                break
+                            frame_data = dataset[frame_idx]
+                            for key in frame_data:
+                                if key not in lst[i]:
+                                    lst[i][key] = []
+                                lst[i][key].append(frame_data[key])
+
+            collected_elements = set()
+            for sys_stat in lst:
+                if "atype" in sys_stat:
+                    collected_elements.update(
+                        np.unique(sys_stat["atype"].cpu().numpy())
+                    )
+


💡 Codebase verification

Based on the code analysis, I can now provide a final response:

Add validation check for missing elements after collection

After the final collection of elements, add a validation check to ensure all required elements were found:

# After the final collected_elements update if all_elements - collected_elements: missing = all_elements - collected_elements warnings.warn(f"Could not find frames containing elements: {missing}")

Add this check at line 118 in deepmd/pt/utils/stat.py after the final collection loop

Import warnings module at the top of the file

🔗 Analysis chain

Consider adding validation for collected elements.

After collecting elements, it would be good to verify that all required elements were found and log a warning if any are still missing.

Let me analyze the code further to understand the element collection and validation context.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Check if there are any elements that might be consistently missing across datasets # Search for element type definitions and usages rg -A 5 "atype|element_to_frames"

Length of output: 84908

Script:

#!/bin/bash # Search for element validation or error handling related to missing elements rg -A 5 "missing_elements|element_to_frames|all_elements"

Length of output: 2630

deepmd/pt/utils/dataset.py

@@ -40,6 +40,18 @@
        b_data["natoms"] = self._natoms_vec
        return b_data

+        def _build_element_to_frames(self):


deepmd/pt/utils/stat.py

+                    collected_elements.update(
+                        np.unique(sys_stat["atype"].cpu().numpy())
+                    )
+


…devel

for more information, see https://pre-commit.ci

…devel

for more information, see https://pre-commit.ci

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

source/tests/pt/test_make_stat_input.py (4)
14-23: Consider using collections.defaultdict for element_to_frames.
You can simplify the nested checks for element presence in the dictionary by using a defaultdict(list), which would eliminate the need for the explicit if atype not in self.element_to_frames: condition.
-from collections import defaultdict

class TestDataset:
    def __init__(self, samples):
        self.samples = samples
-        self.element_to_frames = {}
+        from collections import defaultdict
+        self.element_to_frames = defaultdict(list)
        for idx, sample in enumerate(samples):
            atypes = sample["atype"]
            for atype in atypes:
-                if atype not in self.element_to_frames:
-                    self.element_to_frames[atype] = []
                self.element_to_frames[atype].append(idx)
25-28: Rename the property to better reflect usage.
Using @property but naming it get_all_atype can be confusing. Consider a more descriptive name like all_atypes, since Python properties typically avoid "get_" prefixes.

53-59: Remove or use the assigned lst variable.
The variable lst is assigned but never used, according to static analysis hints. Consider removing it or using it for additional assertions.
 def test_make_stat_input(self):
     nbatches = 1
-    lst = make_stat_input(self.datasets, self.dataloaders, nbatches=nbatches)
+    _ = make_stat_input(self.datasets, self.dataloaders, nbatches=nbatches)
     all_elements = self.system.get_all_atype
     unique_elements = {1, 2}
     self.assertEqual(unique_elements, all_elements, "make_stat_input miss elements")
🧰 Tools

🪛 Ruff (0.8.2)

55-55: Local variable lst is assigned to but never used

Remove assignment to unused variable lst

(F841)

61-62: Optional test runner inclusion.
Having the if __name__ == "__main__": unittest.main() block is fine. You could remove it if tests are run by a dedicated test runner.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between adf2315 and dc64307.

📒 Files selected for processing (3)

deepmd/pt/utils/dataset.py (4 hunks)
deepmd/pt/utils/stat.py (1 hunks)
source/tests/pt/test_make_stat_input.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.8.2)

source/tests/pt/test_make_stat_input.py

55-55: Local variable lst is assigned to but never used

Remove assignment to unused variable lst

(F841)

🔇 Additional comments (4)

source/tests/pt/test_make_stat_input.py (1)

40-52: Test setup looks good.
The dataset creation for testing is straightforward and clear. No issues found.

deepmd/pt/utils/dataset.py (2)

21-24: Docstring clarity is sufficient.
The docstring effectively describes constructor parameters. No corrections needed.

34-34: Initialization of element frames is a good approach.
Storing the result of _build_element_to_frames() in self.element_to_frames and self.get_all_atype reduces redundancy.

deepmd/pt/utils/stat.py (1)

86-94: No immediate issues with collection of atomic types.
Collecting and updating sets is correct.

coderabbitai · 2024-12-26T07:10:47Z

deepmd/pt/utils/dataset.py

+    def _build_element_to_frames(self):
+        """Build mapping from element types to frame indexes and return all unique element types."""
+        element_to_frames = {element: [] for element in range(self._ntypes)}
+        all_elements = set()
+        all_frame_data = self._data_system.get_batch(self._data_system.nframes)
+        all_elements = np.unique(all_frame_data["type"])
+        for i in range(len(self)):
+            for element in all_elements:
+                element_to_frames[element].append(i)
+        return element_to_frames, all_elements
+
+    def get_frames_for_element(self, missing_element_name):
+        """Get the frames that contain the specified element type."""
+        element_index = self._type_map.index(missing_element_name)
+        return self.element_to_frames.get(element_index, [])
+


💡 Codebase verification

Based on the code analysis, I can now provide a final response about the memory usage concern:

Consider using get_item_torch() instead of get_batch() for building element_to_frames mapping

The current implementation loads all frames at once using get_batch(self._data_system.nframes) which can consume significant memory for large datasets. A more memory-efficient approach would be to iterate through frames using get_item_torch(), which is already implemented in the class for single frame access.

In _build_element_to_frames(), replace the batch loading with frame-by-frame processing:

all_elements = set() for i in range(len(self)): frame_data = self._data_system.get_item_torch(i) frame_elements = np.unique(frame_data["type"]) all_elements.update(frame_elements)

🔗 Analysis chain

Validate potential large memory usage when calling get_batch(nframes).
Calling self._data_system.get_batch(self._data_system.nframes) may consume significant memory in large datasets. Consider chunking or a more incremental approach if data size is large.

Let me search for more specific information about the data system implementation.

Let me check the implementation of the data system class to understand how the data is loaded and managed.

Let me check the implementation of the data system's get_batch method.

Let me check how the data is loaded and managed in the DeepmdData class since it's the underlying implementation used by DeepmdDataSystem.

Let me check the implementation of the _build_element_to_frames method and its usage.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Identify large dataset usage in codebase # We will search for all calls to get_batch() to see if partial batch retrieval is possible. rg -A 3 $'get_batch\\('

Length of output: 15959

Script:

#!/bin/bash # Let's look at the data system implementation to understand the batch retrieval mechanism ast-grep --pattern 'class DeepmdDataSystem { $$$ get_batch($$$) { $$$ } $$$ }' # Also check if there are any chunking mechanisms already in place rg -A 5 "chunk"

Length of output: 2368

Script:

#!/bin/bash # Check the implementation of DeepmdDataSystem class rg -A 10 "class DeepmdDataSystem" # Also check how nframes is defined/used rg -A 3 "nframes"

Length of output: 83091

Script:

#!/bin/bash # Look for the implementation of get_batch in DeepmdDataSystem rg -A 15 "def get_batch" deepmd/utils/data_system.py # Also check if there are any batch size related configurations rg -A 3 "batch_size" deepmd/utils/data_system.py

Length of output: 6071

Script:

#!/bin/bash # Check the implementation of DeepmdData class rg -A 10 "class DeepmdData" deepmd/utils/data.py # Also check how data is loaded and managed rg -A 10 "def get_batch" deepmd/utils/data.py

Length of output: 744

Script:

#!/bin/bash # Look for the implementation of _build_element_to_frames rg -A 10 "_build_element_to_frames" deepmd/pt/utils/dataset.py # Check where this method is used rg -B 5 "_build_element_to_frames" deepmd/pt/utils/dataset.py

Length of output: 1675

coderabbitai · 2024-12-26T07:10:47Z

deepmd/pt/utils/stat.py

+        for dd in frame_data:
+            if dd == "type":
+                continue
+            if frame_data[dd] is None:
+                sys_stat_new[dd] = None
+            elif isinstance(frame_data[dd], np.ndarray):
+                if dd not in sys_stat_new:
+                    sys_stat_new[dd] = []
+                frame_data[dd] = torch.from_numpy(frame_data[dd])
+                frame_data[dd] = frame_data[dd].unsqueeze(0)
+                sys_stat_new[dd].append(frame_data[dd])
+            elif isinstance(stat_data[dd], np.float32):
+                sys_stat_new[dd] = frame_data[dd]
+            else:
+                pass
+        for key in sys_stat_new:
+            if isinstance(sys_stat_new[key], np.float32):
+                pass
+            elif sys_stat_new[key] is None or sys_stat_new[key][0] is None:
+                sys_stat_new[key] = None
+            elif isinstance(stat_data[dd], torch.Tensor):
+                sys_stat_new[key] = torch.cat(sys_stat_new[key], dim=0)
+        dict_to_device(sys_stat_new)
+        lst.append(sys_stat_new)
+


⚠️ Potential issue

stat_data[dd] usage appears out of scope.
Inside this block, stat_data is not defined, which could raise a NameError. Ensure the intended variable is accessible in this scope or replace it with the correct reference.

coderabbitai · 2024-12-26T07:10:47Z

deepmd/pt/utils/stat.py

+    missing_element = all_element - collect_elements
+    for miss in missing_element:
+        for i in datasets:
+            if i.element_to_frames.get(miss, []) is not None:


🛠️ Refactor suggestion

Check for dictionary key presence vs. None.
dict.get(key, []) will never be None unless explicitly set as the default. This current check if i.element_to_frames.get(miss, []) is not None: is redundant.

-if i.element_to_frames.get(miss, []) is not None: +if miss in i.element_to_frames:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if i.element_to_frames.get(miss, []) is not None:

if miss in i.element_to_frames:

coderabbitai · 2024-12-26T07:10:48Z

deepmd/pt/utils/stat.py

+    for miss in missing_element:
+        for i in datasets:
+            if i.element_to_frames.get(miss, []) is not None:
+                frame_indices = i.element_to_frames.get(miss, [])
+                frame_data = i.__getitem__(frame_indices[0])
+                break
+            else:
+                pass
+        sys_stat_new = {}


⚠️ Potential issue

Potential index out-of-range error.
If frame_indices is empty, calling frame_data = i.__getitem__(frame_indices[0]) at line 99 will raise an error. Consider a safe check for an empty list before accessing [0].

for miss in missing_element: for i in datasets: if i.element_to_frames.get(miss, []) is not None: frame_indices = i.element_to_frames.get(miss, []) - frame_data = i.__getitem__(frame_indices[0]) + if frame_indices: + frame_data = i.__getitem__(frame_indices[0]) + else: + continue break else: pass

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for miss in missing_element:

for i in datasets:

if i.element_to_frames.get(miss, []) is not None:

frame_indices = i.element_to_frames.get(miss, [])

frame_data = i.__getitem__(frame_indices[0])

break

else:

pass

sys_stat_new = {}

for miss in missing_element:

for i in datasets:

if i.element_to_frames.get(miss, []) is not None:

frame_indices = i.element_to_frames.get(miss, [])

if frame_indices:

frame_data = i.__getitem__(frame_indices[0])

else:

continue

break

else:

pass

sys_stat_new = {}

SumGuo-88 and others added 2 commits December 23, 2024 20:30

4424

32da243

[pre-commit.ci] auto fixes from pre-commit.com hooks

adf2315

for more information, see https://pre-commit.ci

coderabbitai bot reviewed Dec 23, 2024

View reviewed changes

github-actions bot added the Python label Dec 23, 2024

github-advanced-security bot found potential problems Dec 24, 2024

View reviewed changes

iProzd marked this pull request as draft December 24, 2024 14:37

SumGuo-88 and others added 8 commits December 26, 2024 14:37

issues4424-2

4f6f63d

ll

b9bac38

Merge branch 'devel' of https://github.com/SumGuo-88/deepmd-kit into …

1db3408

…devel

ll

543a318

[pre-commit.ci] auto fixes from pre-commit.com hooks

ba72382

for more information, see https://pre-commit.ci

lll

26f9a17

Merge branch 'devel' of https://github.com/SumGuo-88/deepmd-kit into …

25a803c

…devel

[pre-commit.ci] auto fixes from pre-commit.com hooks

dc64307

for more information, see https://pre-commit.ci

coderabbitai bot reviewed Dec 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix) Make bias statistics complete for all elements #4496

(fix) Make bias statistics complete for all elements #4496

SumGuo-88 commented Dec 23, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 23, 2024 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Dec 23, 2024

coderabbitai bot left a comment

coderabbitai bot Dec 23, 2024

coderabbitai bot left a comment

coderabbitai bot Dec 26, 2024

coderabbitai bot Dec 26, 2024

coderabbitai bot Dec 26, 2024

coderabbitai bot Dec 26, 2024

	if i.element_to_frames.get(miss, []) is not None:
	if miss in i.element_to_frames:

(fix) Make bias statistics complete for all elements #4496

Are you sure you want to change the base?

(fix) Make bias statistics complete for all elements #4496

Conversation

SumGuo-88 commented Dec 23, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 23, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 23, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 23, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 26, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 26, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 26, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 26, 2024

Choose a reason for hiding this comment

SumGuo-88 commented Dec 23, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 23, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)