Skip to content

Data Sources Use Cases

Ramona Walls edited this page Oct 9, 2017 · 8 revisions

Use this page to provide a brief description of datasets that may be used to test development of this project.

Rice variation data

Creator: Jorge Duitama

Description: Genomic variation obtained from analysis of high throughput sequencing (HTS) illumina data for 94 varieties of Oryza sativa and 10 wild relatives from O. rufipogon and O. nivara. The 94 O. sativa varieties include 21 latin american elite cultivars from the CIAT rice breeding program, 33 north american elite cultivars from USDA, as well as 40 landraces and 10 wild relatives sequenced and previously analyzed (doi: 10.1038/nbt.2050). Raw sequencing data was analyzed using the NGSEP pipeline (http://sourceforge.net/projects/ngsep/). Details of the analysis and the main research outcomes obtained from these data can be found here (doi: 10.1371/journal.pone.0124617). SNPs and small indels for the 104 varieties are provided as a single VCF file, whereas predictions of copy number variation (CNVs), large indels and inversions are provided as one file per sample in GFF format.

Subject/Keywords: rice, genomic variants, high throughput sequencing

Status

  • Study is complete and published.
  • Data is published in CyVerse and Dryad, has two DOIs.
  • Mapped to data model
  • This dataset was used to test the identity algorithms.
  • Once the portal is ready for user testing, we can ask Jorge to register his files.

NEON microbial data

Creator: Lee Stanish

Description: Soil microbial diversity and gene composition data from the National Ecological Observatory Network (NEON) were obtained using high-throughput Illumina sequencing. Marker gene sequences from the 16S ribosomal RNA gene produced relative abundance data of bacteria and archaea. Shotgun metagenomic sequencing was used to determine gene composition of soil microbial assemblages.

16S gene sequence data in fastq format were processed using the QIIME pipeline (v1.8). Briefly, sequence files are demultiplexed to separate sequences by sample. Demultiplexed sequences are passed through the QIIME pipeline with closed-reference OTU clustering based on 97% sequence similarity using uclust, and taxonomic identification carried out using the Greengenes database.

Demultiplexed metagenomic sequences in fastq format are processed using the default settings in MG-RAST. Taxonomic identifications are made using the refseq database and functional subsystems are generated using SEED.

Currently, all sequence data are on a secure FTP site with restricted access, as well as on a network drive at NEON. The metagenomic data also are located on MGRAST and are accessioned.

A suite of environmental metadata also exist at NEON on networked drives. The publically available sequences at MG-RAST are linked to NEON samples via sampleIDs. However, they are not linked with metadata housed at NEON.

There are approximately ~140 metagenomes and ~1500 marker gene samples.

Status

  • Data generation is ongoing, but a public set of data is ready for testing.
  • Mapped to data model.
  • Need to assess user requirements.

2 Maize Projects

Creator: Nathan Springer, Jawon Song

Description:

My suggestion is that we start with two specific projects. One of these projects is somewhat mature and has recently been published. I would like to get identifiers for some of the final files used in this project and find ways to deposit/create access to these files. The other project is in progress and will have some distinct files created. For each project I will describe the data and some of our goals.

Project 1. WGBS (Whole genome bisulfite sequencing) for 5 diverse maize lines. In this project (published under doi: 10.1104/pp.15.00052) we performed whole genome bisulfite sequencing for five maize genotypes. this resulted in an SRA submission of five fastq files. However, the majority of the manuscript is focused on analyses of 100bp tile files. These files report the outcome of the alignment and analysis of methylation. In each file we list the coordinates of the tile, the coverage and the percent methylation in each of three sequences contexts (CG, CHG and CHH). These are the files that Matt and Jawon plan to use for ZED and would also be the files I would share with outside groups such as MaizeGDB. Ideally, I'd like to have identifiers for these 100bp tile files that would allow them to be associated with the SRA and with a metadata description of how they were created. this is particularly important as the same underlying data (the SRA) may be reused in the future to create a new 100bp tile file with altered algorithms or based on alignment to an updated reference sequence. The manuscript can provide a good description of this dataset and both Qing and Jawon should know where the files for this project are.

Project 2. WGBS for genotype PH207. PH207 is a distinct maize genotype that is currently being used to create a de novo assembly. In this case we have performed a WGBS and have a fastq file that will be deposited at SRA. We will be making 100bp tile files two different ways. In one case we will be aligning the reads to the B73 reference genome to make the data comparable with other genotypes. However, we will also be aligning the reads to the PH207 reference. These will create distinct outputs but both should be linked to the same underlying data and it would be nice to have identifiers for both. This data will likely be part of a publication describing this de novo assembly.

I hope that helps get started and both Qing and Jawon should be able to assist with details. Please let me know if we still should try to have a group call on Friday or not. I will see if I can be available if we think that a call would be important. Details and collection access:

Here is the SRA number for the 5 .sra files. Jawon will be able to help you find the corresponding .fastq files. Genotype SRA # B73 SRR850328 Mo17 SRR850332 Oh43 SRX731433 CML322 SRX731432 Tx303 SRX731434

The fastq files are in following directory : /corral-tacc/tacc/iplant/vaughn/springer_vaughn/eichten/5genos/

There are two paired-ended fastq files for each genotype [genotype]_all3_R1_val_1.fq and [genotype]_all3_R2_val_2.fq

The way to convert .sra to .fastq (same as .fq) is use fastq-dump from sratoolkit. We have sratoolkit installed on our system and the instructions on how to use it can be found in this link http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump

Status

  • Study is complete and published.
  • Variation data housed on Corral, but does not yet have a DOI. Sequence files on SRA.
  • Mapped to data model
  • User requirements assessed.

Lungmap images and RNA sequences

Creator: James Carson

The test case that we discussed that appeared to be most appropriate is part of the NHLBI LungMAP initiative (http://grants.nih.gov/grants/guide/rfa-files/RFA-HL-14-008.html). TACC is a participant in one of the LungMAP research centers (the one led by Pacific Northwest National Laboratory (PNNL)) – the Center for Lung Imaging and Omics. The data I work with here at TACC as part of this Center is the imaging data.

LungMAP is focused on the development of the Lung around the time of alveolarization – that is, just before and after birth. Understanding the development of the lung at this time point is important for many reasons, but especially for developing better treatments and procedures for working with babies born prematurely. LungMAP data is collected from mouse and human.

Imaging data for this center is collected at Baylor College of Medicine (BCM) and PNNL. Both locations upload their images to an iPlant account which is linked to iPlant’s Bisque online image viewer. I upload metadata to go with the images. Once image data has been inspected, cleaned, and approved by me, I send it to a LungMAP data coordinating center (DCC) which then makes it publicly available at www.lungmap.net .

BCM is collecting High-Throughput mRNA in situ Hybridization (HT-ISH) images. Basically, a thin slice of lung tissue is marked for a specific gene, and cells that contain that gene transcript show up with a dark purple marker. This gene probe has a specific RNA sequence that needs to be tracked. A gene could potentially have multiple different probes designed for that gene, each performing differently. The images produced need to be tracked along with its metadata, and there may be multiple version of the image if any editing was performed (contrast enhancement; cropping). The tissue specimen needs to be tracked along with its metadata.

Images from PNNL are a type of mass-spec imaging. This method essentially produces hundreds of images from a single tissue section. Like HT-ISH, the specimen information needs to be tracked, and the images with metadata needs to be tracked.

Status

  • Study is near completion.
  • Image data house on Corral, to be published via CyVerse Bisque.
  • Mapped to data model.
  • Initial user requirements assessed.

1KP data

Creator: 1KP consortium Contact: Jim Lebens-Mack, University of Georgia Description: 1000 plant genomes, currently stored in multiple location. Use case for data identity checks.

Status

  • Multiple publications have already come out of this project.

i5K data (pending - never used this case)

Creator: Various at USDA Contact: Cyndy Parr USDA NAL, Monica Poelchau USDA-ARS Description: 5000 insect genomes in various states of completion. Will need to store sequence data in USDA library, in NCBI, maybe in iPlant.

Status

This use case is in reserve, in case we need more use cases.