Fix issue #772, Incorrect zeros included in emsd average when particles are dropped #773
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I've tested this code with imsd() and emsd() on test data and it works, and on actual movie data and it is more accurate but it doesn't make as big a difference as I feared it would. So Issue #772 isn't so bad unless there are a lot of gaps in the data.
Pros of accepting my pull-request: Slightly more accurate emsd data! If there are a lot of gaps, then it's MUCH MUCH more accurate.
Cons: It takes slightly more time to run. With timit, I see 30.6 ms ± 1.55 ms per loop before my edits versus 32.1 ms ± 1.43 ms per loop after my edits.
Issue #772 occurs when there are gaps in the data, so the problem shows up in _msd_gaps().
Here are the edits:
I updated _msd_gaps() with "skipna=False" on the sum() function. This way values based on a lack of data will show up as NaN (instead of showing up as zero).
I also added some code to reset the estimated number of datapoints N to be 0 if there is no data.
result['N'] = np.where(result['msd'].isna(), 0, result['N'])
I was hoping this would be enough but it created another problem that I had to take care of: When the emsd is calculated, there's a problem because we are averaging in a NaN (albeit weighted as N=0), and it turns out NaN * 0 = NaN. I wanted it to come out to zero. So I had to reset the NaN to zero again. Awkward, but it works. Then it's 0*0 = 0, which is the weight I wanted, and it calculates the emsd correctly.
So this update does provide a more accurate emsd() calculation, and it removes the weird DROPS that show up in the imsd data (see Walkthrough output 31).
I plan to use this version going forward because I can't stand inaccuracies!