Skip to content

bigwaff/tomatula

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tomatula

Tomatula is an application that uses Apache Hadoop and Apache Spark to store and query big Variant Calling datasets. It introduces a columnar storage format instead of the flat VCF format for storing the Variant Calling data and storing them across the Hadoop Distributed File System.

Parquet-based data storage format for VCF

This repository contains the scripts to convert a VCF file to the Apache Parquet format. The conversion consists of two steps:

In the first step, the VCF file is parsed and stored as a JSON object. The field names are used as keys of the JSON dictionary and the INFO and individuals' fields are further split into embedded dictionaries. The FORMAT field is used to define the keys of the embedded dictionaries of each individuals' field and is not stored as a separate value in the JSON and parquet formats.

In the second step the JSON object is used to create a new Spark Dataframe and store the data under the new columnar parquet format. Apache Spark includes a built-in library for loading a JSON object an automatically detecting the schema. The keys of the JSON dictionary are used as the field names and the embedded dictionaries are stored as bunches of pages under each field.

The following is a part of the Schema used to store variant calling data (only one individual's schema is shown here).

|-- CHROM: long (nullable = true)
|-- POS: long (nullable = true)
|-- ID: string (nullable = true)
|-- REF: string (nullable = true)
|-- ALT: string (nullable = true) 
|-- QUAL: string (nullable = true)
|-- FILTER: string (nullable = true)
|-- INFO: struct (nullable = true)
|    |-- AB: string (nullable = true)
|    |-- ABP: string (nullable = true)
|    |-- AC: string (nullable = true)
|    |-- AF: string (nullable = true)
|    |-- AN: string (nullable = true)
|    |-- ANN: string (nullable = true)
|    |-- AO: string (nullable = true)
|    |-- CIGAR: string (nullable = true)
|    |-- DP: string (nullable = true)
|    |-- DPB: string (nullable = true)
|    |-- DPRA: string (nullable = true)
|    |-- EPP: string (nullable = true)
|    |-- EPPR: string (nullable = true)
|    |-- GTI: string (nullable = true)
|    |-- LEN: string (nullable = true)
|    |-- LOF: string (nullable = true)
|    |-- MEANALT: string (nullable = true)
|    |-- MQM: string (nullable = true)
|    |-- MQMR: string (nullable = true)
|    |-- NMD: string (nullable = true)
|    |-- NS: string (nullable = true)
|    |-- NUMALT: string (nullable = true)
|    |-- ODDS: string (nullable = true)
|    |-- PAIRED: string (nullable = true)
|    |-- PAIREDR: string (nullable = true)
|-- LYC2740: struct (nullable = true)
|    |-- AO: string (nullable = true)
|    |-- DP: string (nullable = true)
|    |-- GL: string (nullable = true)
|    |-- GT: string (nullable = true)
|    |-- QA: string (nullable = true)
|    |-- QR: string (nullable = true)
|    |-- RO: string (nullable = true)

The repository contains an example dataset of variant calling data of different types of the tomato clade in both VCF and Parquet format. The example of the genomes ranges within positions 40000000 and 40004000 of chromosome 6 of the tomato genome (from Aflitos et al. 2014).

Installation

In order to run the examples you need to have Spark installed. You can find instructions for building Spark at the official Apache Spark website.

Example query

Assuming that you have installed Spark and $SPARK_HOME is set in your environment, you can clone this repository with

git clone tomatula

Two example queries for finding the allele frequencies within a selected range on the genome are included, one for each type of input file.

$SPARK_HOME/bin/spark-submit tomatula/AlleleFrequencyVCF.py --input examples/example.vcf --output outputs/outputFile1.txt --chr 6 --from 40002000 --to 40003000
$SPARK_HOME/bin/spark-submit tomatula/AlleleFrequencyParquet.py --input examples/example.parquet --output outputFile2.txt --chr 6 --from 40002000 --to 40003000

For the conversion from VCF to parquet via JSON try the following code:

$SPARK_HOME/bin/spark-submit tomatula/vcfTOjson.py --input examples/example.vcf --output outputs/example.json

$SPARK_HOME/bin/spark-submit tomatula/jsonTOparquet.py --input outputs/example.json --output outputs/example.parquet

Note 1
In case you are using a remote cluster, depending on the configuration, scripts may need to be called with absolute paths, as in:

$SPARK_HOME/bin/spark-submit tomatula/AlleleFrequencyVCF.py --input file://$(pwd)/examples/example.vcf --output file://$(pwd)/outputs/outputFile1.txt --chr 6 --from 40002000 --to 40003000

Note 2
You can specify the number of executors to be used for the application by adding --num-executors.

Note 3 To allow for saving in CSV fromat, in Spark versions lesser than 2.0 the optional argument of spark-submit may be called: --packages com.databricks:spark-csv_2.10:1.4.0, as in

$SPARK_HOME/bin/spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 tomatula/AlleleFrequencyVCF.py --input file://$(pwd)/examples/example.vcf --output file://$(pwd)/outputs/outputFile1.txt --chr 6 --from 40002000 --to 40003000

Attribution and feedback

Please send your feedback to A. Boufea or I.N. Athanasiadis.

Cite this work as:

A. Boufea, R. Finkers, M. van Kaauwen, M. Kramer and I.N. Athanasiadis, Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet, Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT '17). ACM, 2017, p. 219-226, doi:10.1145/3148055.3148060

Experimental data have been deposited on Zenodo DOI

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%