Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined values in bins occupied by insertion in membrane. #45

Open
ojeda-e opened this issue Jul 9, 2021 · 7 comments
Open

Undefined values in bins occupied by insertion in membrane. #45

ojeda-e opened this issue Jul 9, 2021 · 7 comments
Assignees
Labels
refactoring code rewrites without functional changes

Comments

@ojeda-e
Copy link
Member

ojeda-e commented Jul 9, 2021

When calculating get_z_surface undefined values in bins inside the embedded element (i.e. protein) may arise since no z coordinates populate bins in the grid. Such undefined values spread in the array during the calculation of curvature.

One possible solution is to interpolate the bins occupied by the protein. I would avoid using pandas although that approach seems the most straightforward. Check np.interpolate.

@ojeda-e ojeda-e added the refactoring code rewrites without functional changes label Jul 9, 2021
@ojeda-e ojeda-e self-assigned this Jul 9, 2021
@orbeckst
Copy link
Member

pandas is slow, use numpy/scipy!

In GridDataFormats we have the option to interpolate on a grid https://github.com/MDAnalysis/GridDataFormats/blob/be6132ac13041390a880061e4e873044b6c29573/gridData/core.py#L311 ; this is not the cleanest code but perhaps useful to initially look at.

@lilyminium
Copy link
Member

Commenting here instead of the PR so that discussion is in one place: I'm not sure that this is a good idea. There is no membrane where the protein is, so there shouldn't be any membrane curvature there either.

@ojeda-e
Copy link
Member Author

ojeda-e commented Jul 22, 2021

Thanks @lilyminium .

I was considering having the interpolation optional and adding a warning message when np.nans are simply too much. I didn't add anything in base.py yet, this is the function only. Here are three reasons to consider.

  1. If there are np.nans in the region of the protein, the adjacent values will be also np.nans. It means we may lose information around the protein. Yes, in the region of the protein we already lost data. But keeping the nans means the resulting array will add np.nans to all the adjacent cells. To make this statement visual, here is an example of what happens when that region is not interpolated, and how the calculation of the gradients spread nans. See for instance how a single undefined value ends up in an extended region of np.nan in the bottom left corner. Same happens in all the cells that delineates the protein.

image

  1. With no interpolation, contours won't happen. plt.contours doesn't work if the data has np.nans. It will limit the user to use plt.imshow.

  2. Even in the case when we say, ok, fine, let's go with plt.imshow instead, the np.nans remain problematic unless you opt for a None interpolation. Interpolations are not only more pleasant to read, but also make results clearer. Almost all the interpolation methods are based on gradients. And again, gradient of np.nan is more np.nan (annoying np.nans). To visualize this limitation, let's see the plt.imshow plot of Gaussian curvature looks like using with different types of interpolation:

image
Even with the best-behaved interpolations (marked with cyan rectangles), there is a potential loss of information, more or less significant, depending on the interpolation method implemented by plt.imshow.

In summary, untreated undefined values may result in extended regions of np.nans. Which ultimately results in a considerable loss of information.

The user will have an option to interpolate the np.nans. A warning of " You have too many nans and your results will be driven by undefined values, check your input" kind of thing can pop when:

  1. the user overestimates the number of n_x_bins, n_y_bins in the grid,
    or
  2. the protein (or any object inserted) occupies too many n_x_bins, n_y_bins in the cell (because either the section of the protein in the membrane is huge, or either because the user underestimated the number of n_x_bins, n_y_bins in the grid).

But in any case, and in the way I see the solution to calculate curvature, the option of interpolation has to be provided. I was considering an option to mask the protein, but I won't elaborate on it (unless I am asked to) and will leave it as an enahcement.

@lilyminium
Copy link
Member

Thank you for your detailed and clear explanation, @ojeda-e. However, I think I must be missing something, because I don't understand how the protein causes that nan value on the edge. How big is the membrane relative to grid you've illustrated?

I would think that the more flexible solution is to return all the data and let users make their own choices about how they want to interpolate and visualize the grid.

In summary, untreated undefined values may result in extended regions of np.nans. Which ultimately results in a considerable loss of information.

Is this still true with your current code that uses nanmean to calculate the mean curvatures?

@ojeda-e
Copy link
Member Author

ojeda-e commented Jul 22, 2021

Sorry @lilyminium, my bad for not clarifying.

My intention with the illustration was to show the same effect (spread of undefined values) due to separate effects:

  1. The effect of an np.nan when it is at at the edge (the ones in the squares), and
  2. the effect of a group of np.nans (coming from the embedded protein).

In the first case is easier to identify how a single np.nan spreads over the calculation of curvature. But 1 and 2 are not correlated in any fashion with those because of the insertion in the membrane (protein). :)

Is this still true with your current code that uses nanmean to calculate the mean curvatures?

You mean, if even by using nanmean we still may have a considerable loss of information?

My answer is: I think yes, it can happen, but it depends on several factors. For example, in systems with extremely short simulation time (low n_frames == bad idea); (and/or) if the user overestimates bins and during the n_frames not even one lipid populates. It can even depend on the lipid composition. For example. If I have a membrane with (let's be extra) 90:10 CHOL:PC. We would have only 1/10 of the total number of lipids in the system to use as a reference to derive the surface. Then it will be hard to fill all the cells in the grid (but again, it also depends on how big the unit cells are).
In these cases, there may be space for a case in which a full array is nans. It's unlikely but I think it can happen.

Edit: Maybe we can have some opinions from @MDAnalysis/gsoc-mentors ?

@hmacdope
Copy link
Member

I'm very impressed by your detailed analysis @ojeda-e! In general my approach is that a np.nan is a np.nan and so be it. You can give the user the option of using some kind of method to get rid of them with something like a new_data = old_data.interpolate(kernel='gaussian'). I (try) to view np.nan as a friend rather than an enemy as it stops you getting absolute rubbish and not realising it as you might under some other circumstances.

@ojeda-e
Copy link
Member Author

ojeda-e commented Jul 23, 2021

Thanks, @hmacdope.
I agree with the general approach of np.nan is np.nan. However, in the calculation of curvature, there are two serious limitations, that when combined, the problem gets more complicated.
One is sampling. We know that. We know that we need i) a high amount of points to derive a surface, and ii) a high number of frames.
The second is the definition of curvature itself. Curvature is defined by second partial derivatives. As it is in curvature.py.
The equations are

Now, the fact that we calculate second partial derivatives, means that we calculate gradient twice. In this example, we can see the effect of only one np.nan in the calculation of first and second gradients.

image

These are some of the reasons why calculating the curvature of biological membranes can get extremely hard. The formalism asks to have second partial derivatives. Gradients are calculated as differences between adjacent values. Now, add to this rigid formalism that we depend on the lipid composition of the membrane. So, if have a bilayer with 1:1:1 phospholipid_1:phospholipid_2:other_lipid, where other_lipid is any type of lipid that has a flip-flop rate high enough that we have several flip-fllop events in the interval we are averaging, well, we lose 1/3 of the elements used to derive the surface. Now, imagine that, on top of all the above you had a simulation setup small enough that your protein covers lots of unit cells.
You'll get np.nans. If you don't treat them, the calculated curvature, according to the equations shown, will be meaningless.

In this context, and how I see the problem, having the option to interpolate is necessary. The possibility of abusers of interpolation methods, that's another story. That the user is calculating curvature in a simulation setup that is not suitable to calculate curvature, that's another story too. The user will be more than welcome to turn interpolation off. But it should be offered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring code rewrites without functional changes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants