Define a new format for the biobox.yaml input data #207

michaelbarton · 2017-02-17T17:48:06Z

The existing version has shortcomings for more complex bioinformatics tasks
such as read profiling. I've created this issue to start the discussion for a
new version of the biobox.yaml format.

@pbelmann, before we discuss what the implementation of the new format could
look like, I think it would be good to first describe the current problems we
are having with the existing one. This will help us determine what problems the
new format should solve. Could you describe the following as a starting point?

What are the problems the existing format creates?
What does the existing format prevent the bioboxes/users from doing?

pbelmann · 2017-02-20T12:00:21Z

What are the problems the existing format creates?

I have been thinking about our discussion on friday. I agree that it makes sense to have the arguments in a specific order. It makes sense when you have to use for example multiple fastas that are used by a tool in two different contexts.

Regarding this I will update the profiling format to:

version: 1.0.0
arguments:
  - reads:
    - type: fastq
      value: /path/to/fastq
  - databases:
    - value: /path/to/ncbi_dump
      type: bioboxes.org:/taxonomy_ncbi_dumps

There is still one thing that I think is quite important:
The optional "cache" tag must be supported by the bioboxes-py library because of many profiling and binning tools that download or build their own database, which can not be standardized. By using the cache directory a tool could download the database into the cache dir and reuse it on a second run.

We could extend the bioboxes-py library/ interface like this:

version: 1.0.0
arguments:
  [...]
  - cache:
    - value /path/to/cache
      type: cache

or this:

version: 1.0.0
arguments:
  [...]
  cache:  /path/to/cache

I prefer the second, since there will be just one cache directory anyway.

michaelbarton · 2017-02-27T22:08:52Z

I'll make the following three observations, which I see as problems for the
current biobox.yaml format:

As a developer, understanding what the schema file is and how to integrate
it when creating a biobox. We (the biobox project creators) know that this
file is used to ensure that the input parameters and data are valid. When I
introduce outside developers to creating bioboxes, this schema file is
difficult to read, and it's use is not clear what its use is without
further explanation.
As a developer, it is difficult to understand how to easily extract the
required fields from the biobox.yaml to integrate them into the command
line arguments for software. This requires additional explanation of how
the format is defined, and that the developer must create some kind of
automated script to parse the required fields out of it.
As a user, it's often difficult to create a biobox.yaml file for the same
above reasons. The format is obscure when you're not as familiar with it as
we both are.

I raise this because I think the format of the cache keyword is perhaps a
symptom of a larger problem with deciding the format. I have some initial
suggestions of how we can improve this in the longer term to make developing
new bioboxes easier for others outside of the core developers. I would first
like to hear your thoughts on what you've seen as the problems when trying to
get others to write and create bioboxes.

pbelmann · 2017-02-28T08:32:36Z

As a developer, understanding what the schema file is and how to integrate
it when creating a biobox. We (the biobox project creators) know that this
file is used to ensure that the input parameters and data are valid. When I
introduce outside developers to creating bioboxes, this schema file is
difficult to read, and it's use is not clear what its use is without
further explanation.

Yes, I also think it is difficult to read, but since we can use the command line tool, I don't think it is a problem anymore.

As a developer, it is difficult to understand how to easily extract the
required fields from the biobox.yaml to integrate them into the command
line arguments for software. This requires additional explanation of how
the format is defined, and that the developer must create some kind of
automated script to parse the required fields out of it.

Yes it is difficult for someone who is not familiar with creating bioboxes.
We could try to build a tool that makes it easier extracting the arguments.

As a user, it's often difficult to create a biobox.yaml file for the same
above reasons. The format is obscure when you're not as familiar with it as
we both are.

I raise this because I think the format of the cache keyword is perhaps a
symptom of a larger problem with deciding the format.

The cache keyword is in my opinion the only solution to tools that use their own custom database.
And there are a lots of binning and profiling tools that do this. The other point is that tools should not save intermediate files (that could be large) into the container. As a user I would like to provide a path to
a different, maybe bigger volume where intermediate files could be stored. I think this is mandatory for many tools used in metagenomics.

I have some initial
suggestions of how we can improve this in the longer term to make developing
new bioboxes easier for others outside of the core developers. I would first
like to hear your thoughts on what you've seen as the problems when trying to
get others to write and create bioboxes.

Everytime I ask a developer to build a biobox, I always provide an example first.
However this requires the developer to investigate the code (for example how to extract the fields) and in some cases it is quite difficult.

michaelbarton · 2017-03-02T01:44:07Z

Thanks for your feedback Peter. I suggest the following solutions to these
issues.

Create a tool that takes a biobox signature string and generates the validation
YAML file. I think would solve the first problem where the purpose of the
validation file is not clear for people new to creating bioboxes. For example
the tool could take the string "[FASTQ] -> FASTA" and do all the additional
input validation steps based on the signature. This would also have the
advantage of 'closing the loop' between defining a biobox signature and what
the arguments should be, and how the RFCs should be written. Specifically we
could discuss each biobox in terms of the signature and that would define
everything else.

The cache keyword is in my opinion the only solution to tools that use their
own custom database. And there are a lots of binning and profiling tools that
do this. The other point is that tools should not save intermediate files
(that could be large) into the container. As a user I would like to provide a
path to a different, maybe bigger volume where intermediate files could be
stored. I think this is mandatory for many tools used in metagenomics.

I agree. When I referred to this as a symptom I was speaking to the point of
how to define this in the argument list, rather than whether it should be
included at all. It is definitely undesirable to store large files in Docker
images. I think that a metagenome tool that you're referring to would have a
signature of something like:

[fastq] + Maybe cache -> ...

Where the Maybe keyword would mean that the signature validation tool would
allow the cache entry to not appear in the biobox.yaml, but if it did appear
then it should be validated. Therefore I think that all biobox.yaml entries for
every tool type should have the format:

- name:
    value: ...
    type: ...

Or for lists:

- name:
  - value: ...
    type: ...
  - value: ...
    type: ...

So for example with cache:

- cache:
    value: /path/to/dir
    type: directory

Or for lists such as [fastq]:

- fastq:
  - value: /path/to/file
    type: fastq
  - value: /path/to/file
    type: fastq

In the short-term, a tool to do this doesn't exist in the format we would
immediately need. However I have built [a prototype that is quite close.][]. I
therefore suggest we continue writing the biobox.yaml validation files in the
short term, and develop the validation tool in the medium term. I hope this
also answers your specific question on my opinion about the cache keyword.

Yes it is difficult for someone who is not familiar with creating bioboxes.
We could try to build a tool that makes it easier extracting the arguments.

I agree, a tool to make extracting the inputs from the biobox yaml would I
think make it easier for developers to get the arguments they need. I think
also a tool to create the biobox.yaml also would be useful for anyone who
doesn't want to use the command line tool either.

pbelmann · 2017-03-02T14:41:11Z

@michaelbarton I updated the profiling interface in PR #210 according to our discussion. Please merge if you agree.

michaelbarton · 2017-03-03T01:07:20Z

Thanks Peter. I've merged this. It might be worth discussing what we should do with the ID field and how useful this still is, I don't see any tools currently using it so far?

pbelmann · 2017-03-03T08:13:44Z

Thanks Peter. I've merged this. It might be worth discussing what we should do with the ID field and how useful this still is, I don't see any tools currently using it so far?

I agree. I think we introduced the id field for the fragment size parameters.

michaelbarton assigned pbelmann and michaelbarton Feb 17, 2017

pbelmann added this to the 3./10 March Meeting milestone Feb 20, 2017

This was referenced Feb 20, 2017

Command Line Interface should have a run command for Profiling tools #208

Open

Tutorial for using profiling Bioboxes. #209

Open

pbelmann added a commit to pbelmann/rfc that referenced this issue Mar 2, 2017

fix(profiling): use yaml format aggreed on issue bioboxes#207

dcc1621

pbelmann added a commit to pbelmann/rfc that referenced this issue Mar 2, 2017

feature(profiling): updated schema issue bioboxes#207

5a4b261

pbelmann mentioned this issue Mar 2, 2017

updated profiling interface see #210

Merged

pbelmann mentioned this issue Mar 10, 2017

feature(container): allow key values in biobox file and configurable version number bioboxes/bioboxes-py#34

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define a new format for the biobox.yaml input data #207

Define a new format for the biobox.yaml input data #207

michaelbarton commented Feb 17, 2017

pbelmann commented Feb 20, 2017

michaelbarton commented Feb 27, 2017

pbelmann commented Feb 28, 2017

michaelbarton commented Mar 2, 2017

pbelmann commented Mar 2, 2017

michaelbarton commented Mar 3, 2017

pbelmann commented Mar 3, 2017

Define a new format for the biobox.yaml input data #207

Define a new format for the biobox.yaml input data #207

Comments

michaelbarton commented Feb 17, 2017

pbelmann commented Feb 20, 2017

michaelbarton commented Feb 27, 2017

pbelmann commented Feb 28, 2017

michaelbarton commented Mar 2, 2017

pbelmann commented Mar 2, 2017

michaelbarton commented Mar 3, 2017

pbelmann commented Mar 3, 2017