Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{Edge,Node}Population should have a .to_pandas method. #140

Open
matz-e opened this issue Apr 7, 2021 · 9 comments
Open

{Edge,Node}Population should have a .to_pandas method. #140

matz-e opened this issue Apr 7, 2021 · 9 comments

Comments

@matz-e
Copy link
Member

matz-e commented Apr 7, 2021

See title. For better usability, SONATA™ should provide functionality to provide a subset of the populations as Pandas dataframes for easier manipulation. Ideal usage from my side:

import libsonata as so
pop = so.EdgeStorage("foo.h").open_population("bar")
df = pop.to_pandas(so.Selection([(123, 666)])
stuff = df[(df.source_node_id > 313) & (df.axonal_delay < 3)]

(paraphrasing a bit)

@mgeplf
Copy link
Contributor

mgeplf commented Apr 7, 2021

This sort of functionality is part of SNAP. I'd prefer to avoid having pandas as a requirement of python libsonata, because it's a heavy dependency.

@matz-e
Copy link
Member Author

matz-e commented Apr 7, 2021

Sure it is a heavy dependency, but we already depend on numpy, which itself is heavy:

Input spec
--------------------------------
 -   py-pandas

Concretized
--------------------------------
[+]  [email protected]%[email protected] arch=linux-rhel7-x86_64
[^]      ^[email protected]%[email protected] arch=linux-rhel7-x86_64
[^]          ^[email protected]%[email protected]+blas+lapack arch=linux-rhel7-x86_64
[^]              ^[email protected]%[email protected]~ilp64+shared threads=none arch=linux-rhel7-x86_64
[^]              ^[email protected]%[email protected] arch=linux-rhel7-x86_64
[^]                  ^[email protected]%[email protected] arch=linux-rhel7-x86_64
[^]                      ^[email protected]%[email protected]+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4+uuid+zlib patches=0d98e93189bc278fbc37a50ed7f183bd8aaf249a8e1670a465f0db6bb4f8cf87 arch=linux-rhel7-x86_64
[^]      ^[email protected]%[email protected] arch=linux-rhel7-x86_64
[^]      ^[email protected]%[email protected] arch=linux-rhel7-x86_64
[^]          ^[email protected]%[email protected]~toml arch=linux-rhel7-x86_64
[^]          ^[email protected]%[email protected] arch=linux-rhel7-x86_64
[^]      ^[email protected]%[email protected] arch=linux-rhel7-x86_64

not that much more in the dependency tree that isn't numpy

Seems like the counter-argument is to depend on something that is heavier, and pulls in a bunch of morphology dependencies. To me, it more seems like the API augmentations of snap should be migrated here…

@mgeplf
Copy link
Contributor

mgeplf commented Apr 7, 2021

numexpr/dateutil/pytz/etc are quite a bit more than just numpy (spack concretization is deceptive - pip install numpy only installs numpy; pandas installs more.

libsonata is supposed to be very low-level, very low dependy; the productivity stuff goes in SNAP.

@matz-e
Copy link
Member Author

matz-e commented Apr 8, 2021

I disagree: compared to numpy, these additional dependencies don't seem all that heavy. Having to work with SNAP instead seems a little like saying we should use Qt for comfortable XML reading in C++.

@mgeplf
Copy link
Contributor

mgeplf commented Apr 8, 2021

Put another way, numpy is a required dependency in that it's the compact way to return numeric data in python. It would be hard/impossible to not use numpy, which is why it fits with the minimalist purpose of the library. The improvement you're describing is an ergonomic/convenience one, which should be handled by higher level libraries (ie: SNAP).

The idea is that this is safe to use by anything (ex: neurodamus-py), with the mimimal set of requirements.

What is your use case?

@matz-e
Copy link
Member Author

matz-e commented Apr 9, 2021

My use case is to bulk load SONATA into Pandas to pass through to Spark. If I look into a file manually, I would also use this to compare between SONATA, Parquet, and binary data… so having some .to_df that returns something with columns ['source_node_id', 'target_node_id', 'delay', 'conductivity'…] would be very nice and still pretty basic.

@mgeplf
Copy link
Contributor

mgeplf commented Apr 9, 2021

Since you have to implement it for your use case, we should be able to take a look at it, and then make a decision.

@alkino
Copy link
Member

alkino commented Nov 26, 2021

For exemple, report_reader.hpp with DataFrame is ready to load inside pandas. Is it a solution for you? @matz-e

There is no dependency to pandas inside libsonata, but the output data is oriented pandas.

@matz-e
Copy link
Member Author

matz-e commented Nov 26, 2021

Can I add columns to it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants