Skip to content

Adding a dataset

sdgamboa edited this page Mar 19, 2024 · 2 revisions

For Curators and Developers: Adding a New Dataset to bugphyzz

March 19, 2024

1. Identify the “Attribute type” of the data

The initial task entails determining the suitable data type for the new dataset. Four distinct data types are considered: binary, multistate-union, multistate-intersect, and numeric. These data types influence the output of various functions within the bugphyzz workflow. Comprehensive information regarding these data types is outlined in the data schema. Upon determining the appropriate data type, it must be incorporated into the dataset as a variable (next step).

2. Create a Google spreadsheet

Create a CSV file to upload to Google Drive, where the attribute data is stored. This CSV file will ultimately be imported into R through the bugphyz::importBugphyzz function as a tidy data.frame, i.e., each column will be a variable and each row will be an observation/annotation. The tidy data.frame will follow the data schema.

Some examples:

The data requires manual curation. In some cases a computer-readable format doesn’t exist or it’s not readily available for direct download. For example, the information must be captured from a figure. In this case, the data should be directly captured into a Google spreadsheet in the Google drive

The data comes from a direct download. Sometimes the data is in computer-readable format and can be downloaded directly. However, some data manipulation might be required, like cleaning the data, standardizing attributes, etc. In this cases, a script should be used for down

scrip used to manipulate the data to the bugphyzzWrangling repo. * If the data comes from API, the best option would be to create and R client to access the data then save it as CSv document.

The data must be downloaded through an API

3. Generate a csv file in the right format that can be uploaded to the

google drive {LINK}. * Manual curation.Create a spreadsheet directly on google drive (Eg. sphingolipid). * Bulk download. Create script and save it in bugphyzzWrangling. Upload the output to the drive. (E.g. Madin et al). * API. Best opinion would be to create and R package client to download and format the data. Upload the output to the drive. (E.g. BacDive with R)

4. Add Information about Attrubute and source

* Attribute
* Source
* Thresholds.

5. Add/Modify code in packages as necessary to import the new data in the right format through the physiologies function.

* This could include the link to the spreadsheet.
* Create helper functions to import a spreadsheet in a different format and shape it like the data model.
* Run Sysdata to generate updated NCBI IDs

6. Run 10-fold validation

7. Run bugphyzzExports.

Curated spreadsheets

  1. Add a new curated spreadsheet to drive and publish to the web as a csv file. If this data was parsed with code, store the script in waldronlab/bugphyzzWrangling/scripts. Otherwise, write a description in a README.txt file.
  2. Download the github repo waldronlab/bugphyzz.
  3. Create and checkout a new branch with username and an identifier (e.g., sdgamboa/new-physiology). Command example: git checkout -b sdgamboa/new-physiology.
  4. Add the links of the published csv and the spreadsheet source to the links.csv file.
  5. Run the sysdata.R script to add full taxonomy, including parent ranks, to each taxon with taxid.
  6. Add and commit changes.
  7. Create a pull request and ask for a review –>> name here who could be those reviewers <<–.

The changes will be checked with the unit tests. Changes might be requested. For example, adding values to the attributes.tsv file if not present.

Through and API

This is the ideal way for adding data, because it’s eaiser to update. It’s also the most complicated way of contributing because it requires the creation of new functions or modifying existing functions in bugphyzz.

  1. Write code to parse and use the API.
  2. Download the github repo waldronlab/bugphyzz.
  3. Create and checkout a new branch with username and an identifier (e.g., sdgamboa/new-physiology). Command example: git checkout -b sdgamboa/new-physiology.
  4. Create functions for using the data from the API and integratint the data into currenct bugphyzz physiologies (or add new physiologies).
  5. Add and commit.
  6. Create a pull request, asking for a review.