Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a new format for the biobox.yaml input data #207

Open
michaelbarton opened this issue Feb 17, 2017 · 7 comments
Open

Define a new format for the biobox.yaml input data #207

michaelbarton opened this issue Feb 17, 2017 · 7 comments
Assignees

Comments

@michaelbarton
Copy link
Contributor

The existing version has shortcomings for more complex bioinformatics tasks
such as read profiling. I've created this issue to start the discussion for a
new version of the biobox.yaml format.

@pbelmann, before we discuss what the implementation of the new format could
look like, I think it would be good to first describe the current problems we
are having with the existing one. This will help us determine what problems the
new format should solve. Could you describe the following as a starting point?

  • What are the problems the existing format creates?

  • What does the existing format prevent the bioboxes/users from doing?

@pbelmann
Copy link
Member

What are the problems the existing format creates?

I have been thinking about our discussion on friday. I agree that it makes sense to have the arguments in a specific order. It makes sense when you have to use for example multiple fastas that are used by a tool in two different contexts.

Regarding this I will update the profiling format to:

version: 1.0.0
arguments:
  - reads:
    - type: fastq
      value: /path/to/fastq
  - databases:
    - value: /path/to/ncbi_dump
      type: bioboxes.org:/taxonomy_ncbi_dumps  

There is still one thing that I think is quite important:
The optional "cache" tag must be supported by the bioboxes-py library because of many profiling and binning tools that download or build their own database, which can not be standardized. By using the cache directory a tool could download the database into the cache dir and reuse it on a second run.

We could extend the bioboxes-py library/ interface like this:

version: 1.0.0
arguments:
  [...]
  - cache:
    - value /path/to/cache
      type: cache

or this:

version: 1.0.0
arguments:
  [...]
  cache:  /path/to/cache  

I prefer the second, since there will be just one cache directory anyway.

@michaelbarton
Copy link
Contributor Author

I'll make the following three observations, which I see as problems for the
current biobox.yaml format:

  • As a developer, understanding what the schema file is and how to integrate
    it when creating a biobox. We (the biobox project creators) know that this
    file is used to ensure that the input parameters and data are valid. When I
    introduce outside developers to creating bioboxes, this schema file is
    difficult to read, and it's use is not clear what its use is without
    further explanation.

  • As a developer, it is difficult to understand how to easily extract the
    required fields from the biobox.yaml to integrate them into the command
    line arguments for software. This requires additional explanation of how
    the format is defined, and that the developer must create some kind of
    automated script to parse the required fields out of it.

  • As a user, it's often difficult to create a biobox.yaml file for the same
    above reasons. The format is obscure when you're not as familiar with it as
    we both are.

I raise this because I think the format of the cache keyword is perhaps a
symptom of a larger problem with deciding the format. I have some initial
suggestions of how we can improve this in the longer term to make developing
new bioboxes easier for others outside of the core developers. I would first
like to hear your thoughts on what you've seen as the problems when trying to
get others to write and create bioboxes.

@pbelmann
Copy link
Member

As a developer, understanding what the schema file is and how to integrate
it when creating a biobox. We (the biobox project creators) know that this
file is used to ensure that the input parameters and data are valid. When I
introduce outside developers to creating bioboxes, this schema file is
difficult to read, and it's use is not clear what its use is without
further explanation.

Yes, I also think it is difficult to read, but since we can use the command line tool, I don't think it is a problem anymore.

As a developer, it is difficult to understand how to easily extract the
required fields from the biobox.yaml to integrate them into the command
line arguments for software. This requires additional explanation of how
the format is defined, and that the developer must create some kind of
automated script to parse the required fields out of it.

Yes it is difficult for someone who is not familiar with creating bioboxes.
We could try to build a tool that makes it easier extracting the arguments.

As a user, it's often difficult to create a biobox.yaml file for the same
above reasons. The format is obscure when you're not as familiar with it as
we both are.

I raise this because I think the format of the cache keyword is perhaps a
symptom of a larger problem with deciding the format.

The cache keyword is in my opinion the only solution to tools that use their own custom database.
And there are a lots of binning and profiling tools that do this. The other point is that tools should not save intermediate files (that could be large) into the container. As a user I would like to provide a path to
a different, maybe bigger volume where intermediate files could be stored. I think this is mandatory for many tools used in metagenomics.

I have some initial
suggestions of how we can improve this in the longer term to make developing
new bioboxes easier for others outside of the core developers. I would first
like to hear your thoughts on what you've seen as the problems when trying to
get others to write and create bioboxes.

Everytime I ask a developer to build a biobox, I always provide an example first.
However this requires the developer to investigate the code (for example how to extract the fields) and in some cases it is quite difficult.

@michaelbarton
Copy link
Contributor Author

Thanks for your feedback Peter. I suggest the following solutions to these
issues.

Create a tool that takes a biobox signature string and generates the validation
YAML file. I think would solve the first problem where the purpose of the
validation file is not clear for people new to creating bioboxes. For example
the tool could take the string "[FASTQ] -> FASTA" and do all the additional
input validation steps based on the signature. This would also have the
advantage of 'closing the loop' between defining a biobox signature and what
the arguments should be, and how the RFCs should be written. Specifically we
could discuss each biobox in terms of the signature and that would define
everything else.

The cache keyword is in my opinion the only solution to tools that use their
own custom database. And there are a lots of binning and profiling tools that
do this. The other point is that tools should not save intermediate files
(that could be large) into the container. As a user I would like to provide a
path to a different, maybe bigger volume where intermediate files could be
stored. I think this is mandatory for many tools used in metagenomics.

I agree. When I referred to this as a symptom I was speaking to the point of
how to define this in the argument list, rather than whether it should be
included at all. It is definitely undesirable to store large files in Docker
images. I think that a metagenome tool that you're referring to would have a
signature of something like:

[fastq] + Maybe cache -> ...

Where the Maybe keyword would mean that the signature validation tool would
allow the cache entry to not appear in the biobox.yaml, but if it did appear
then it should be validated. Therefore I think that all biobox.yaml entries for
every tool type should have the format:

- name:
    value: ...
    type: ...

Or for lists:

- name:
  - value: ...
    type: ...
  - value: ...
    type: ...

So for example with cache:

- cache:
    value: /path/to/dir
    type: directory

Or for lists such as [fastq]:

- fastq:
  - value: /path/to/file
    type: fastq
  - value: /path/to/file
    type: fastq

In the short-term, a tool to do this doesn't exist in the format we would
immediately need. However I have built [a prototype that is quite close.][]. I
therefore suggest we continue writing the biobox.yaml validation files in the
short term, and develop the validation tool in the medium term. I hope this
also answers your specific question on my opinion about the cache keyword.

Yes it is difficult for someone who is not familiar with creating bioboxes.
We could try to build a tool that makes it easier extracting the arguments.

I agree, a tool to make extracting the inputs from the biobox yaml would I
think make it easier for developers to get the arguments they need. I think
also a tool to create the biobox.yaml also would be useful for anyone who
doesn't want to use the command line tool either.

@pbelmann
Copy link
Member

pbelmann commented Mar 2, 2017

@michaelbarton I updated the profiling interface in PR #210 according to our discussion. Please merge if you agree.

@michaelbarton
Copy link
Contributor Author

Thanks Peter. I've merged this. It might be worth discussing what we should do with the ID field and how useful this still is, I don't see any tools currently using it so far?

@pbelmann
Copy link
Member

pbelmann commented Mar 3, 2017

Thanks Peter. I've merged this. It might be worth discussing what we should do with the ID field and how useful this still is, I don't see any tools currently using it so far?

I agree. I think we introduced the id field for the fragment size parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants