Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let's investigate the SPARC ecosystem #4

Closed
elvijs opened this issue Mar 18, 2024 · 6 comments
Closed

Let's investigate the SPARC ecosystem #4

elvijs opened this issue Mar 18, 2024 · 6 comments
Labels
documentation Improvements or additions to documentation

Comments

@elvijs
Copy link
Collaborator

elvijs commented Mar 18, 2024

There may be some helpful tools in the SPARC ecosystem. Let's spend 1 day taking a look and checking whether they simplify anything.

nih-sparc/sparc.client#22 (comment)

@elvijs elvijs added the documentation Improvements or additions to documentation label Mar 18, 2024
@elvijs
Copy link
Collaborator Author

elvijs commented Mar 18, 2024

cc @Olivier-tl

@elvijs
Copy link
Collaborator Author

elvijs commented Mar 18, 2024

Noting this puppy as well: https://github.com/Pennsieve/pennsieve-agent-python/tree/main

@Olivier-tl
Copy link
Owner

Olivier-tl commented Apr 23, 2024

We can access the data files from any published datasets through the "Discover Service" of the Pennsieve API. The only limitation is that each requested file must be below 5GB. No authentication required. It's simple:

  1. Discover all files and directories in a dataset using this endpoint.
  2. Get download link to a file using this endpoint

🕵🏻‍♂️To investigate further: It looks like the pennsieve platform can automatically process EDF files (see processor-EDF) and has an API for timeseries data. However, an api key is required to access that API.

pennsieve-python does not support the "Discover Service" of the Pennsieve API.

--- Outdated comment below ---

How can we leverage the SPARC ecosystem for the REVEAL data client?

TLDR; Data files can be requested through the Pennsieve API. Users need an api key which they can generate with a free account. The pennsieve-python package facilitates interfacing with the Pennsieve API.

❓Open question: Although osparc uses the pennsieve api through a user generated api key to expose SPARC datasets, its user interface only allows to connect one file at a time to a service node. This is unpractical, as the REVEAL dataset will have thousands of files and will be regularly updated. Can osparc provide the pennsieve api key to our REVEAL App service as an environment variable?

⚠️ EDIT: As a test, I created a new Pennsieve account. Unless I am (manually) added to an organization, I don't have access to any datasets.
image

⚠️ Edit 2: There is a 15gb limit for data download even when logged in on Pennsieve

pennsieve.io vs sparc.science

Published datasets ends up on sparc.science where they are openly available to the public. On the other end, pennsieve.io requires an account and only contain a third of the published datasets (65/220). Sparc.science offers free direct download for datasets 5gb or smaller, otherwise they need to be downloaded through AWS S3 and the requester pays. Having researchers that wants to work with the REVEAL dataset create an AWS account and pay for data download is not ideal. Pennsieve offers presigned S3 urls without limit on the dataset size through its API. Only thing needed is a free Pennsieve account from which an API key can be generated.

Pennsieve-API

SPARC Repositories

sparc-curation

sparc-curation uses the pennsieve-python package (outdated, now pennsieve-agent-python?) to connect to the Pennsieve API.

pennsieve-python

  • It looks like pennsieve-python can provide a pre-signed url for files (see here).

  • TimeSeriesAPI with support for annotations?! (pennsieve.api.timeseries)

@elvijs
Copy link
Collaborator Author

elvijs commented Apr 26, 2024

The Pennsieve API looks pretty good!

If I've understood correctly, it enables listing and downloading files from any dataset that is public without any auth (aka will work in a random oSPARC container). This already enables a bunch of functionality for a typical app:

  • list all subjects (get all dataset files and do some regex magic to pull the subject IDs or ask Justin to provide a summary file)
  • visualise surfaces (ask Justin to add a response feature csv, we just load it and plot)
  • visualise raw data (via the file download API; downside: likely to be slow)

@Olivier-tl
Copy link
Owner

Exactly!

@elvijs
Copy link
Collaborator Author

elvijs commented Apr 26, 2024

If we find ourselves dying to visualise timeseries efficiently, then we can also chat with Joost about indexing our timeseries into their backend and exposing API keys in the oSPARC containers for access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants