-
-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize categorizes (keep state, etc.) #687
Comments
Here it would also be nice if one could count the number of categories a single atom is existing in. |
Here are some more thoughts on how to generalize the categories. However,
Hmm, I might come up with some more, but here some discussion points, @pfebrer @ialcon comments? |
Hi @zerothi, I like very much the overall approach, and to me it looks quite logical and organized. I also like both the
In this example what it is not fully clear to me is to what Finally, returning to the iteration of results, which type of info would provide something like for result in results.B :
print(result.n_neighbors) # number of neighbors
print(result.neighbors) # indices of neighbors
... And, super finally, wouldn't make more sense than this for result in results :
print(result.B.n_neighbors) # number of neighbors
print(result.B.neighbors) # indices of neighbors
print(result.B.distances) # distance to each neighbor
... Of course I guess this also depends on how simple/complex you want each |
2/3. I don't understand the results you are proposing. Why do we need a 2D array of results? To me the result of categorizing should be an atom-wise property. I.e. an array of length
I think a clean approach would be to first define atom-wise properties, maybe under an
These properties, when combined with a geometry, should return an array containing the property value for each atom. Then, from them, using comparison operators, one could create a
These categories would be combined into composite categories as the current categories are now. Then one could also use these properties to create
The idea is that these atom properties I think could be much more versatile and clean, as they would be defined as atom-wise arrays of data. So their shape is more predictable. And they can be reused for other things. E.g. to color or scale atoms in plots 😄 I would advocate for their returns being simple arrays, but maybe I think the ring identification could be better off in a separate functionality unless we can frame it as an atom-wise problem. Otherwise it makes the concept of categories much more complicated. |
Indeed, that was the idea.
Yeah, here I am a bit more undecided. Since you are already in the
The main problem of the way things currently work is exactly that it is a category per atom.
Another motivation of mine, is the re, it works quite similarly. You query some quantities, and loop the results. EDIT, sorry didn't do an actual example. My motivation for this would be that one wanted to search for neighbors, but only with
The problem with properties, is that one might be querying properties with manual tweaking values, and then it gets fairly complicated to have something defined generically, it is also rather hard to make more complex searches.
I think that this would just be difficult to extend, whereas my approach seems a bit more generic, and would allow one to search also based on other criteria. For instance, a ring with 5 Carbon and 1 Nitrogen.
That was why I suggested to make it based on the full geometry, and not necessarily an My main worry is that limiting its functionality basically makes it useless. Using a @pfebrer see my edit! |
Isn't the problem that the category is computed atom by atom? I mean if you give the whole geometry to the property calculator and then it calculates an atom-wise property, that is not a limiting factor, no?
I understand this, but maybe a cleaner solution would be to cache the results of categorization / property calculation, so that a second query is basically for free?
No, why? You can get at least to the same level of complexity. E.g. imagine that the NNeighbor(R=3) # Gives you the number of neighbors up to a certain R.
NNeighbor(R=3, neighbor=Z > 1) # Gives you the number of neighbors without counting hydrogens. The only difference is that you specify the properties, not the categories, and then you create "categories" by using the comparison operators. I'd say that if something you have more freedom to get as complex as you want easily. You can for example create "categories" by comparing two properties: x > y # This would be a mask that selects atoms whose x coordinate is bigger than the y coordinate. And actually I have thought of something that could work quite good with the ring finder. The returned property could be of shape [
[0, 0, 0], # Atom 0 belongs to [ring_0, ring_1, ring_2] ?
[0, 1, 1], # Atom 1 belongs to [ring_0, ring_1, ring_2] ?
[1, 0, 0], # Atom 2 belongs to [ring_0, ring_1, ring_2] ?
...
] Which could be a sparse matrix, of course. But the point is that this would still be an atom-wise property. And doing things like: rings_property = RingFinder()
rings_count = np.sum(rings_property, axis=-1) Would give you a property that computes the number of rings to which each atom belongs. Again you would then be free to create a category however complex you want using this property. You could do many other cool things. E.g.: # Property that returns a (na, nZ) array specifying how many neighbors of each type an atom has.
neighs = NNeighbors(dim1=Z)
# Category that selects atoms that have 1 hydrogen neighbor and 2 carbon neighbors.
my_complex_category = neighs[:, 1] == 1 & neighs[:, 6] == 2 Wouldn't it be nice? 🥲 |
By the way all this lazy arithmetic (and the caching) could be supported by the |
I kind of also understand the point of @pfebrer and, somehow, agree with it. At the same time, I also understand that, as you say @zerothi, limiting to atom-wise categorization limits a lot what one could do (specially in terms of complexity and versatility). Would it be possible to have So, following the initial example by @zerothi, instead of doing: cat = CategoryRing(...) & CategoryZ(Z=6) * 5 You could have: get_rings = Ring([(AtomZ(6), 5), (AtomZ(7), 1)])
rings = Ring.classify(geometry)
for ring in rings :
print(ring.na)
print(ring.id_ring)
print(ring.idx)
print(ring.dihedral)
.... I am not sure if this makes sense to you, but to me is sort of trying to have some sort of hierarchy, so that we can keep separate the complexity of more versatile objects from the simplier atom-wise categories. |
Yes, the problem is that it is an
Yeah, this might be overdoing it, perhaps we shouldn't worry here. ;)
But what is the difference between this and a
I guess categories could do the same, no ?
This just looks horrendous? Wouldn't these groupings be simpler to loop in the proposed result method? Kind of like
I somewhat agree, i just always find API's that have some kind of flexibility very nice to work with, it takes a bit of time to work it out. But... That is why this discussion is important, what can be done, and how would it work ;) |
The way I see it, in the rings example you are creating multiple categories on the fly, one for each ring that you find: # This would give you the number of rings each atom belongs to
np.sum(rings_property, axis=-1)
# This would give you some measure of "connectivity" (through rings) between atoms
rings_property @ rings_property.T
# This would give you the atoms that belong to ring 2
np.where(rings_property[:, 2])
# Etc...
The convenience of the shape of the result depends on what you want to achieve afterwards. It is not obvious to me that with the one that you propose it is simpler to achieve @ialcon's original goal (separating the two ribbons of an NPG).
Yes, but the point is that if you allow the user to build their own conditions you don't have to implement all the possible conditions as categories.
The main point that I want to make is that a category (as it is now) is the combination of a property + a condition. Internally inside categories we are implementing complex properties like number of neighbors, but we are not exposing that for users to use however they want. Also, we are choosing which conditions to implement, which limits the functionality. Another thing is that what is now called Regarding the points of @ialcon, atom-wise properties could also have metadata, so I would say this is not an argument in favor of categories vs properties. |
I think that this discussion is, somehow, touching on a deeper discussion about what should be the limits of So, summing up, I kind of agree now that the approach of @pfebrer might be the best to go as of now. We know that these atom-wise |
I'm having a hard time understanding what you mean by categories that are not atom wise. What does "categorizing" mean for you? For me it means assigning a category to each entity. In the case of the geometry the most obvious entity is the atom. I understand that you might create higher level entities like fragments, and then you might categorize those fragments. But the algorithm that creates those fragments is a grouping operation and in my view it can be understood as categorizing atoms on the category: |
My proposal can search for any rings using a simple deterministic approach that defines how a ring should be percieved.
I just think that what you are doing is a quite drastic change, for instance for geometries with 100,000 atoms, you should pre-define the positions of the rings? How would you do that? and then you would get a list of 100,000 x rings? Ok sparse matrices, but... From what you propose you have to pre-define the rings, before you can make any meaningful sense of the return values, no? What does I am not saying your method can be useful for few, and well defined properties, such as
That was never the intent, the intent was still to do as we have it now, something that defines some basic categories, then combinations of these categories via boolean operations should behave as though they were a single category.
Not really, if you want users to extend the properties. Consider a user who wants to search for a defect in a system. Whether it is your or my approach, they would have to subclass a property/category anyways, no?
I agree that the way the categories are handled, it is a little restrictive when everything returned is a
So my question, is how do you define the
The basic I think that most users don't really think that the # properties (pfebrer, you might have thought more carefully about this, how would this look?
prop = geom.properties([Z == 6, x > 5, x < 10])
idx = np.sum(prop, axis=1).nonzero()[0]
# categories
cat = AtomZ(6) & AtomXYZ.x < 10 & AtomXYZ.x > 5
idx = cat.categorize(geom) I think the latter is more intuitive. While the
The real problem really is that we want to do I am not saying my |
But I propose to do exactly the same as you say, just shaping the result differently to fit in a more general framework (atom-wise properties).
No, it would work the same as your proposal. You are right that getting a certain ring index on advance doesn't make sense. You can only do it when you have the results.
No, it would be a sparse matrix, so basically it uses the same memory as the return that you propose.
If the user has to implement a new property, yes. If it's just a combination of existing properties with some comparison operators, then no.
I don't know, the same way that you define the category.
No, my idea is that properties can be used in a very similar way as categories. In this example: from sisl.atom_properties import Z, x
# Create a mask by creating a boolean atom property
mask = (Z == 6) & (5 < x < 10)
# Use it (e.g. in methods that have an atoms argument)
geom.sub(mask) But you can also use non-boolean properties (e.g. geom.plot(atoms_style={"color": Z, "size": x}) There can also be properties that are themselves boolean, without the need for a comparison. This would be similar to the current categories. E.g. an from sisl.atom_properties import is_sp3
# We can use this property directly as a mask, because it contains boolean values
geom.sub(is_sp3) There could also be properties that compute a string for each atom. E.g. a from sisl.atom_properties import hybridization
# A compute (or get) method could be exposed to trigger the computation of the property
hybridization.compute(geom) # ["sp", "sp2", "sp", ...] |
Ok, I kind of get where you are at. :) But, I guess all this could be resolved by exposing the functional equivalents? However, it poses some problems. We could expose properties on the geometries But from what I read they don't need to be exclusive, they could live together, properties and categories. It actually seems like a
This would be equivalent in I can sort of see that properties can be powerful in the sense that they compute stuff based on the input geometry. I think, since you are thinking of properties that can return different things, probably we should think about a streamlined return object? Instead of bool/str etc. Also, I don't think we should do: from sisl.atom_properties import Z, x it can cause name-clashes, what about |
What I like about the properties is the same thing that I like about the categories. You can build them when you don't have a geometry, and then use them for several geometries. So in the example of the geometry plot, if you have passed
Yes, that was my first thought, but then I didn't see the point in having a special class for boolean properties. I guess a
There is no difference, that's the point. But now the computation of the properties is exposed. Well there is a difference: with the current categories you can only do
Could it be a subclass of
I guess it would be up to the user to do something like from sisl import atom_properties
atom_properties.Z or like from sisl.atom_properties import Z but we could also just use the terminology that we use currently for categories. I.e. |
Yeah, we might do something like that. But... ;) |
Which methods would you overwrite? What I mean is that this could be the result returned from the property, not the |
Alternatively we could return a simple |
What benefit is there using xarray? We should strive for the simplest solution IMHO. That will probably be a ndarray, I.e. Does |
That we can store arbitrary attributes without subclassing and there is a built-in way of storing information about coordinates. Also it is easy to merge dataarrays of different dtype into a dataset.
Yes, these kind of things work with dataarrays. But yeah I'm also not so sure if it's a good idea keeping in mind that most people is not familiar with them. |
Yeah, I think the familiarity with numpy is good! So let's do that! |
Well, it seems that you converged... Sorry for not participating more pro-actively, but this got quite technical for me to add anything useful honestly... :S Just for me to understand the final approach, could you quickly remind me what is the difference between the current implemented |
I don't know how would the rings thing turn out in the end, but the differences between the current categories and (possible future) atom properties are:
|
No worries Isaac! Your comments are extremely valuable, you'll be the target user ;)! :) Adding to @pfebrer, it seems to me that the boolean operations done on a property would implicitly return a Category. We might even have the category work equivalently, |
What methods would an atom category have that an atom property doesn't? Also, your plans are still to put the rings functionality into a category, right? Could you clarify what is the reasoning behind it? Would it be an For example, something that I would understand it would make sense that it was a category/property is the number of rings each atom belongs to. |
Yes, I agree that it does not seem like the more natural thing to do... It may be that for |
Describe the feature
Currently the categories are quite strict in the sense that they provide the functionality to check individual atoms.
However, there are cases where it could be valuable to retain state information in the form of
A primary concern when trying to solve #202 is to reduce the overhead of doing the neighbor estimation twice, which is necessary when doing the categories.
Instead we might propose something like this:
a category is a base class that is only used to check stuff
when categorizing a
Category
, it should return aCategoryResult
class specifying whether it succeeded (if res
) and some information. I imagine something like:This approach is not too different from
scipy
optimizers. My main worry is the optional variables associated with the categories.Are there other ways to do this?
The text was updated successfully, but these errors were encountered: