Skip to content

Use cases for development team

Ramona Walls edited this page Oct 11, 2017 · 30 revisions

These will allow us to complete step 6 in the [SOP for contributor use cases] (https://github.com/identifier-services/IDservices_assessment/wiki/SOP-for-analyzing-contributor-use-cases).

Note that numbers of files, specimens, etc. in these scenarios are for illustration only, not actual numbers. Scenarios are generic, not meant to support only a single use case.

User scenarios

Scenario 1 - Complete

At least 90% of this is in place, so the first part should be possible by July 15 (everything up to DOI request) and the full scenario by August 1.

  • user creates a project
  • user registers a fastq file located on an Agave system
  • user adds metadata about the fastq file based on a template, plus two fields that they add themselves.
  • system prompts user to create specimen
  • user creates specimen, adds metadata
  • user creates a process, specifies fastq file as input, provides metadata about process
  • user registers 3 more fastq files that are on an Agave system, adds metadata
  • user creates 2 new specimens, one of which is associated with two of the fastq files
  • user creates 3 new instances of the same process, using each of the new fastq files as input
  • user registers data files as output of processes
  • user registers SRA identifiers for the 4 existing fastq files
  • user creates a dataset consisting of the 4 fastq files

Everything above should be done by July 15.

  • users requests a DOI
  • system pulls metadata from project, files, and specimens to populate Datacite metadata file, prompts user to check suggestions and supply missing values.
  • user approves metadata and requests DOI
  • system generates DOI (and shadow ARK) and registers those IDs with IDS.
  • dataset becomes public

Scenario 2

This matches requirements for the NEON use case. This use case is on hold.

  • user creates a project
  • user registers 8 specimens (6 raw and 2 pooled) and adds metadata
  • user creates to processes for specimen pooling. Each process has 3 specimens as input and one as output.
  • user creates a process for metagenomic sequencing, specifies a pooled specimen as input, supplies metadata
  • user duplicates the above process for the other pooled specimen
  • user registers 2 new files and specifies that they are output of the metagenomic sequencing processes
  • user requests ARKs for each object in the project
  • system uses supplied metadata to generate files for each object
  • user reviews metadata and requests ARKs.
  • system generates ARKs and registers the IDs with IDS.

Scenario 3

The requirements for this are very similar to scenarios 1 and 2. This should be ready by September 30. The addition to this use case is that it allows users to specify relations between different datasets by linking their PIDs using the partOf relation.

  • user adds 6 specimens, all metadata is the same for them, except one field is unique (local identifier)
  • user registers six sequence files with local identifiers, each associated with one of the specimens
  • system prompts user to create 6 processes
  • user creates six sequencing processes (specimens as input, sequence files as outputs)
  • user registers a variation file
  • system prompts user to create process
  • user creates variant detection process, specifies 6 sequence files as input, variant file as output
  • user registers 6 SRA identifiers for the sequences
  • user creates a dataset that contains the variation file.
  • user uploads/registers a readme file to describe the dataset, location of inputs, specimens, etc.
  • user wants a separate PID for the variant file and one for the whole dataset containing the varient file plus six fastq files. Since the fastq files are already in SRA, and not part of the final dataset, system suggests ARKs for the fastq files and a DOI for the dataset.
  • system suggests metadata for ARKs, based on the data associated with the files and their specimens
  • user edits metadata and requests ARKs
  • system issues ARKs, registers them with IDS
  • user requests a DOI for the dataset (with the variant file and the readme).
  • New: user specifies that 6 fastq files are partOf the dataset that has the DOI, using the partOf relation to link the ARKs to the newly requested DOI.
  • system suggests metadata for DOI, user edits and requests DOI
  • system generates DOI and registers it with IDS

Scenario 4

This matches the requirements of the lung map use case and is intended to deal with high numbers of files. This should be done by ~October 15.

(1) user creates a project

(2) user adds 200 specimens with corresponding metadata:

  • user selects "add batch of specimens"
  • system prompts for upload of a csv file containing one row per specimen, for 250 specimens total. The first column must contain the specimenID.
  • system creates objects for 200 specimens and assigns metadata per row of the spreadsheet

(3) user adds 800 probes with corresponding metadata:

  • Probes are another kind of material entity (like specimens). Each probe is associated with a short sequence file (fasta?) and a gene. The same probe can be used on multiple specimens.
  • A single gene can have multiple probes associated with it. This suggests that we may need an object for genes. Genes are used as the basis for datasets (one dataset is all data associated with a gene). However, we may be able to do this with metadata.
  • user selects "add probe"
  • user selects "add batch of probes"
  • system prompts for upload of a csv file containing one row per probe, for 800 probes total. The first column must contain the probeID.
  • system creates objects for 800 probes and assigns metadata per row of the spreadsheet

(4) user registers 1600 image files stored on CyVerse Bisque and associates each one with an in situ hybridization (ISH) process, a specimen, and a probe.

Although CyVerse Bisque files are technically on an Agave system, they are not stored in folders, so the user will have to provide a list with the URL for each image.

  • user selects "add a batch of files"
  • system asks user to select a process type for which the files are an output
  • user selects "in situ hybridization imaging"
  • system asks user to specify the input(s) for the ISH imaging process
  • user selects "specimen" and "probe"
  • system prompts user to enter metadata about the ISH imaging process.
  • user selects "apply same metadata to all ISH imaging processes"
  • user manually enters the metadata for the process (e.g., camera, protocol, etc.)
  • If it is not all the same, it will have to be done in bulk.
  • system prompts user to enter links between the processes and the inputs
  • user uploads a CSV file with 1600 rows and three columns, one for the image file names, one for the specimenIDs (registered in step 2), and one for the probeIDs (registered in step 3).
  • system prompts user to specify the location of the files
  • user selects "register files by URL"
  • user uploads a CSV file that contains a list of the URLs for 1600 images on Bisque
  • How to deal with authentication? Should we require the images to be in a folder on an Agave system? Are James's images already in a folder?"
  • Should we ask this first, and if the user chooses Register by URL, then have the URLs be included in the association step? Probably yes.
  • system creates 1600 file objects of type image
  • system creates 1600 process objects of the type "ISH imaging" and applies the same set of metadata to each process.
  • system associates each ISH imaging process object with a single image file object (via the hasOutput relation).
  • system associates each imaging process object with the correct input specimen and probe (via the hasInput relation), based on the uploaded CSV file.

Scenario 5

This is a continuation of Scenario 4. Under construction

  • user creates 250 datasets. Each dataset contains the images from a single specimen.
  • user requests and ARK for each of the 250 datasets
  • user creates 10 datasets of 100 images, each based on a development stage/specimen combination
  • user requests 10 DOIs for the above datasets.
  • system gathers metadata for user to review
  • system generates DOIs and registers them with IDS

Use Case 1

User wants to register a project on Identifier Services

Selects the investigation type – Genomics investigation

At any point the user can enter Specimen information or post a link to the metadata for the specimen if that is already described in an authoritative source (need to think if we can export that metadata through agave – for future reference). If the user has not entered that info by the time they request a DOI, then the system will ask them to do so before providing a DOI.

User Indicates the locations of the repository in which the files are and enters the authentication mechanisms (password and username) for it. He inputs all the paths to the directory in which the files that he wants to register are ( just like you saw for the maize data on Corral and on SRA) and selects the entity to which the files belong.

A user may have files corresponding to the same entity in more than one repository, some being the same data on each location or a slightly different format. The system has to clearly indicate that these files or groups of files belong to the same entity, they are comparable and are located in different places (and the user has to enter their location or URI or DOI if some of them are already public and available without restriction so AGAVE can grab them).

System indicates that repository/ies can be accessed and displays the files/directory contents that the user has pointed to per location. Files on SRA xxx Files on Corral xxx (for this demo we can just get to the files on Corral).

Then the user has to select the files that he wants to register, which may be in different repositories . So per location he/she will select one or more files and mark them as any of the data model options that we provide – here we say that the 5 fastq files on corral correspond to the sequencing grouping. (t should be clear though to the system that while belonging to a same project the files on Corral are located in a different place from other similar files like the fasta files on SRA, as they will be compared in the future, or may not be assigned a same DOI)

Then the user wants to complete an action such as : verify that the files are in their location, or calculate a checksum for each file. Results need to be displayed checksums for example need to be displayed and stored in our Agave metadata store, as the first verification for authenticity. In the future we can compare the checksums obtained for the same files or for the supposedly identical files that are in the different repositories. So actions can be – verify file location - calculate checksum (for the first time) - check checksum . We can also have actions to verify file size, in reality we need to keep that metadata, but I am afraid that may not mean much by itself, specially in files like Weijia is looking at that are the “same” but have different checksums and different sizes.

Extra text for later

  • system asks user to specify the input(s) for the processes
  • user selects "specimens"
  • system prompts user to enter metadata about the sequencing process
  • user selects "apply same metadata to all sequencing processes"
  • user manually enters the metadata for the sequencing process
  • system prompts user to enter links between the processes and the inputs
  • user uploads a CSV file with two columns, one for the sequence file names and one for the specimenIDs (registered in step 2)
  • system prompts user to specify the location of the files
  • user selects "files are on a registered Agave system"