Skip to content

Latest commit

 

History

History
75 lines (59 loc) · 4.38 KB

Abyss-Blackbox-optimization.md

File metadata and controls

75 lines (59 loc) · 4.38 KB

ABySS Blackbox Optimization

Nivretta Thatra October 15, 2016

Abstract

  • When tackling optimization of parameters, the process is manual and tedious: submitting jobs to a scheduler, rerunning failed jobs, inspecting outputs, tweaking parameters, and repeating. In genome sequence assembly, for example, there are a variety of parameters related to expected coverage of the reads and heuristics to remove read errors and collapse heterozygous variation.

  • BlackBox parameter optimization tools do exist, but their usibility and speed need to be compared & evaluated.

  • Approaches of optimimation tools:

    • Exploitation Approach: Start halfway and try a point to the left and to the right, find next best, iterate
    • Exploration Approach: Try a bunch of random ranges
    • Grid Search Approach
  • Here we evaluate 3 optimization tools - Opal, SpearMint, and ParOpt - on a dataset of a human bacterial artificial chromosome (BAC), using the assembler ABySS.

  • We find that

Goals/Methods

Evaluate 3 parameter optimization tools for usability and speed

  • Divvy up 3 tools for testing

  • Download ABySS

  • Access dataset from ORCA computing machine

  • Backend

    • What types of input parameters (discrete with large/small ranges, continuous, binary)
    • Make it portable to other commandline tools so optimizer can be told how to launch the tool
  • Results output

    • Generate target metrics vs parameters plot
    • Generate Pareto frontier of the target metric and a second metric of interest (contiguity and correctness) likely in R using ggplot
    • Generate a report of the results of the optimization
  • Write a short report of our experience

  • Post on GitHub pages

  • Possibly submit to a preprint server (bioRxiv, PeerJ, Figshare)

  • Possibly submit for peer review, such as F1000Research Hackathons

Dataset(s) and Optimizers

  • Dataset

  • a human bacterial artificial chromosome (BAC), using assembler ABySS

  • Metrics

  • The key metrics are contiguity (a.) and correctness (b. through d.).

    1. contiguity (NG50, N50) and aligned contiguity (NGA50, NA50)
    2. number of breakpoints when aligned to the reference as a proxy for misassemblies
    3. number of mismatched nucleotides when aligned to the reference, Q = -10*log(mismatches / total_aligned)
    4. completeness, number of reference bases covered by aligned contigs divided by number of reference bases
  • We'll be optimizing the NG50 metric (or NGA50 with a reference genome) and reporting (but probably not optimizing) the correctness metrics.

  • The primary parameter we'll be optimizing is k (a fundamental parameter of nearly all de Bruijn graph assemblers), and there's a bunch other parameters that we can play with (typically thresholds related to expected coverage).

  • Optimizers being evaluated

Results