Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature vector - "feature names" #68

Open
materialsguy opened this issue Aug 13, 2021 · 2 comments
Open

Feature vector - "feature names" #68

materialsguy opened this issue Aug 13, 2021 · 2 comments

Comments

@materialsguy
Copy link

materialsguy commented Aug 13, 2021

Hello,

I'm currently analysing a machine learning model of somebody else, that is trained using soap feature vectors.
The code generating the feature vector looks something like that:

soap = SOAP(species=species, periodic=True, rcut=2.5, nmax=8, lmax=8, average="inner", sparse=False) feature_vectors = soap.create(atoms, n_jobs=1)

Where species is a set that holds the different element names and atoms is a list containing Atom typed elements like: Atoms(symbols='O18Al12', pbc=True, cell=[[4.76, 0.0, 0.0], [-2.379999999999999, 4.122280922013928, 0.0], [0.0, 0.0, 12.993]], spacegroup_kinds=...). The feature_vectors are then transformed into a rather big pd.dataframe that contains 1109304 columns.

Is there a way to find out the feature names (physical meaning) of the single values of a feature_vector? For me currently it is "just" a row in a dataframe which the model then is based on without any column descriptions. For my analysis it would be interesting to know which column is representing what in a physical way since my analysis results in some kind of feature importance of the respective column.

Thank you very much.

Best regards,

Claus

@lauri-codes
Copy link
Contributor

Hi @materialsguy!

This is an excellent topic. Some time ago I saw something similar in matminer, where you can call feature_labels() to get some kind of information about the features. I do have this as one of the TODO's in our kanban, but as of now, it is not directly possible.

In practice implementing it should be fairly straightforward, but I cannot give any timeline on this. It is possible to reverse-engineer some of the label information by using the get_location()-method, which gives the slice for the given species-pair. But this does not currently support getting the location of specific (l, n)-values.

@materialsguy
Copy link
Author

materialsguy commented Aug 13, 2021

Thank you for the quick reply. I also think such an implementation would really help from a machine learning feature engineering & feature analysis perspective, especially when the analysis is done by somebody that has not the full knowledge about the feature vectors themselves from a physical point of view. Please let me know when you implemented it.

I will have a look at the get_location()-method.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants