Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MD return values -- what should they be #588

Open
1 of 2 tasks
zerothi opened this issue Jun 20, 2023 · 13 comments
Open
1 of 2 tasks

MD return values -- what should they be #588

zerothi opened this issue Jun 20, 2023 · 13 comments
Milestone

Comments

@zerothi
Copy link
Owner

zerothi commented Jun 20, 2023

The problem is that we need to converge the returned values in some way.

Currently some methods returns

  • lists
  • collection
  • geometrycollection

Now what should we do about multiple returns.
See below for two ideas, which would we prefer.

Things to consider:

  • easy to work with

  • get performant code (ideally they could be (md, ...) arrays)

  • consistency across the outputs

            I've addressed all comments except this one as I am not sure we want/need to unpack. For instance, the current behaviour seems logical to me:
    
traj = stdoutSileVASP(f).read_trajectory[:3]()
for i in range(3):
    print(traj[i].force)
# or
for step in traj:
    print(step.force)

You would prefer this?

traj = stdoutSileVASP(f).read_trajectory[:3]()
for i in range(3):
    print(traj.force[i])
# or
for F in traj.force:
    print(F)

Originally posted by @tfrederiksen in #573 (comment)

List of actions:

@zerothi
Copy link
Owner Author

zerothi commented Jun 20, 2023

@pfebrer your inputs here would be valuable, I know you have commented on something like this before, here a more targeted issue.

@zerothi zerothi mentioned this issue Jun 20, 2023
4 tasks
@zerothi zerothi added this to the v0.14 milestone Jun 20, 2023
@tfrederiksen
Copy link
Contributor

I think lists are easy and intuitive for multiple read_* calls. As long as the individual functions are clear what a single call produces (Geometry, PropertyDict, ndarray, etc) I would let the user do the postprocessing (or provide the postprocess function). Are there any particular concerns with this?

@zerothi
Copy link
Owner Author

zerothi commented Jun 20, 2023

I think the main idea about a container would be that it can allow some arbitrary functions.

For instance one can do:

gcoll = something.read_geometry[:]()
sub_gcoll = gcol.applymap(Geometry.sub, atoms=[1, 2, 3])

to extract some atoms for all geometries in a new collection.

The Collection was added in #546, should we revert it and simply return the list?

@tfrederiksen
Copy link
Contributor

I guess I do not see a clear advantage, rather some additional complexity without a clear use case.

With lists one can achieve the mapping with

[g.sub([1, 2, 3]) for g in gcoll]

@pfebrer
Copy link
Contributor

pfebrer commented Jun 20, 2023

For trajectories I think the best would be to introduce an extra dimension (MD step) in arrays like the coordinates. It would make everything simpler and more efficient. But as you said perhaps sisl shouldn't focus on supporting trajectories and it should be managed externally.

For general collections of geometries, I don't know, because I have no use case in mind. Unless you have a clear use case, I believe lists are fine for now. Whatever Collection class that is implemented in the future will probably have all the functionality that a list has (i.e. it will be iterable, you will be able to retreive items using indices...) so the probability of backward compatibility problems is very low.

I agree that if the only purpose of this collection is to map functions it is probably not worth it. I would wait to have some more interesting usage in mind before converting this list to a custom class. The list can always be upgraded in the future, instead needing to "downgrade" from custom class to list will most likely cause problems.

@pfebrer
Copy link
Contributor

pfebrer commented Jun 20, 2023

So, in the first message of this issue, for trajectories I would prefer something similar to the second snippet. But perhaps that should come as an output of a method called read_trajectory, e.g. read_trajectory[:](), not just the general read_geometry[:]().

Every time I deal with trajectories I like to organize my data in xarray.Datasets because they allow me to do lots of operations easily. So traj in the snippet could be a Dataset, then the amount of things you need to implement in sisl gets very reduced. For example, with a Dataset you could also easily loop over all the data by MD step as in the first snippet if you want.

@zerothi
Copy link
Owner Author

zerothi commented Jun 21, 2023

Ok, I see.
Lets delete the Collection class.

I can see the benefit of xarray, perhaps now is the time to step into that regime.
The only problem with the dataset is when requesting data from MD + SCF, then there are variable SCF steps for each MD, and hence it might be a bit problematic in that case? There some other solution is necessary. Perhaps a list [MD] of datasets [SCF].

My proposal would then be to:

  • read_geometry returns a list
  • read_* returns a dataset where applicable, otherwise a list of dataset, datasets should ideally be complete with attributes describing origin, full details etc.

@pfebrer
Copy link
Contributor

pfebrer commented Jun 21, 2023

I didn't know, but xarray can have sparse arrays as variables.

See the example here: https://docs.xarray.dev/en/stable/internals/duck-arrays-integration.html#integrating-with-duck-arrays

And in principle one can generate the sparse dataset from a dataframe: https://docs.xarray.dev/en/stable/generated/xarray.Dataset.from_dataframe.html#xarray.Dataset.from_dataframe

So that would be suitable for the SCF data.

When I have time I can try to play with it in outSileSiesta.read_scf, because I don't know how it would actually work.

@zerothi
Copy link
Owner Author

zerothi commented Jun 21, 2023

I am not sure that would be great going forward. That would mean that all scf data is contained in a sparse array, which isn't particularly fast.

I think it would be wiser to do a list of Dataset, for easier interaction. I might be wrong, but it just feels weird to use a sparse array to have consecutive data that is truncated at different levels...

@pfebrer
Copy link
Contributor

pfebrer commented Jun 21, 2023

Hmm I don't know, because with a list of Datasets it is not easy to do operations across MD steps.

I don't know if the sparse dataset is a good choice, but a pandas dataframe for example seems better than a list of datasets.

@zerothi
Copy link
Owner Author

zerothi commented Jun 21, 2023

Perhaps one should just not be able to extract MD + SCF at the same time, either MD or SCF.

@pfebrer
Copy link
Contributor

pfebrer commented Jun 21, 2023

Hmm I don't know, I have used the outSileSiesta.read_scf(as_dataframe=True) and it worked great for me.

@zerothi
Copy link
Owner Author

zerothi commented Jun 21, 2023

Hmm.. I don't know, yeah probably fine with that then. :)
We should just have some pretty simple rules so users don't need to read documentation all the time.

@zerothi zerothi modified the milestones: v0.14, v0.15 Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants