Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for custom attributes and read type description of fastq #102

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

cgirardot
Copy link

Implements #86

Allows to add "schema_attribute[tag]" (e.g. sample_attribute[treatment]) in the input schema tables (tsv only) where the schema ('sample', 'run', 'experiment', 'study') e.g. a new sample_attribute[treatment] column in the ena_sample.tsv. These extra headers are injected in the XML generation stream, and injected in the generated XML as a ATTRIBUTE sequence (templates where modified accordingly). For samples, only the default ERC000011 was modified to support these additional attributes. Unit is not yet supported.

Additionally, support for read_type and read_label (as new headers in the ena_run.tsv) is added to the run XML for files of type fastq to support single cell situations where more than 2 fastq files are available (ENA then requires to have read_type described). Multiple values can be passed using CSV format eg paired,cell_barcode

Limitations: read_label is not fully supported as it would require to support SpotDescriptorType in the run XML but it is unclear how this information could be passed. Basic support for SPOT_DECODE_SPEC with a READ_SPEC using BASE_COORD (see SRA.common.xsd) could be provided with:

  • headers like READ_SPEC 1...READ_SPEC n where the header number is the READ_SPEC's READ_INDEX
  • value would be formatted like READ_LABEL:READ_CLASS:READ_TYPE:BASE_COORD:SPOT_LENGTH. For example: UMI1:Application Read:Other:1:8.

…e[tag]' and inject each of these tags as a schema_attribute XML sequence. Additionally, support for read_type and read_label (not fully supported yet though) is added to the run XML (when present in the run ena table)
@bedroesb
Copy link
Collaborator

@cgirardot Thanks a lot for starting this. This is a long wanted feature. I was wondering if you could also test/make it work for xlsx files or ISA-JSON (which should be easy since they all, including the tsv's, get parsed to a pandas dataframe). The controlled vocabulary you added, could you add this to the xml updater script in var ? I don't hardcode these things but pull them straight from ENA and create the xml's with variable vocabularies on the fly using jinja templates. WHich make me wonder where you are getting the ENA_template_FASTQ and ENA_template_READ_TYPE from? The SpotDescriptorType question I will look into another day.

@cgirardot
Copy link
Author

@bedroesb thank you for your quick feedback.

@cgirardot Thanks a lot for starting this. This is a long wanted feature. I was wondering if you could also test/make it work for xlsx files or ISA-JSON (which should be easy since they all, including the tsv's, get parsed to a pandas dataframe).

I dont know when I can have a look at this. My data management system exports data using the tsv tables format so it was easy to get in my code to have the additional information exported with different studies available in our DM (single cell , not single cell ...). I am not sure how to get started with the other formats and test.

The controlled vocabulary you added, could you add this to the xml updater script in var ? I don't hardcode these things but pull them straight from ENA and create the xml's with variable vocabularies on the fly using jinja templates. Which make me wonder where you are getting the ENA_template_FASTQ and ENA_template_READ_TYPE from? The SpotDescriptorType question I will look into another day.

I am really not familiar with these templating frameworks (took me quite some time to get in...) but can take a look. The values are coming from:

  • ENA_template_READ_TYPE is coming from SRA.run.xsd ; see line 85 the READ_TYPE element
  • commit of ENA_template_FASTQ is a mistake, this was a first attempt and I forgot to remove it (it is not used) i.e. the fastq case is now managed in the ENA_template_runs.xml directly ; see line 26 and below.

Will try to find some time in the next days. Happy holidays!

@cgirardot
Copy link
Author

@bedroesb I forgot to mention about the pipeline failing; this is not on me as far as I can see, the command line is simply missing a value for the --center option.

@cgirardot
Copy link
Author

@bedroesb I managed the second point ie updating the jinja templates so all XML templates are automatically generated.

I also looked into the first point but I dont see how to approach to ISA-JSON easily, this feels out of scope of my contribution and requires much more knowledge into ISA-JSON than I have. One needs to catch which attributes need to be exported as schema_attribute[tag]. Also I dont see where the READ_TYPE lives in the ISA-JSON.

Regarding the xlxs support, this seems it would just work out if the box when adding schema_attribute[tag] columns; not tested tho.

@bedroesb
Copy link
Collaborator

@bedroesb I forgot to mention about the pipeline failing; this is not on me as far as I can see, the command line is simply missing a value for the --center option.

Intresting, 2 weeks ago this worked and I didn't change anything to it. Let me double check the secrets.

@bedroesb I managed the second point ie updating the jinja templates so all XML templates are automatically generated.

I also looked into the first point but I dont see how to approach to ISA-JSON easily, this feels out of scope of my contribution and requires much more knowledge into ISA-JSON than I have. One needs to catch which attributes need to be exported as schema_attribute[tag]. Also I dont see where the READ_TYPE lives in the ISA-JSON.

Regarding the xlxs support, this seems it would just work out if the box when adding schema_attribute[tag] columns; not tested tho.

That is really great! And looks good at first sight. Related to the ISA-JSON, you are right, I will do a test with the test-isa-json and text-xlsx to see if everything still behaves as it should. Which reminds me that I should improve the testing to try out those + do some test submissions, only problem there is that the files, aliases one submits should be unique.

@bedroesb
Copy link
Collaborator

bedroesb commented Dec 26, 2024

To Do list for myself:

  • Fix workflow checks
  • Test with ISA-JSON and xlsx files as input
  • DO a real submission with extra attributes
  • Test submission read-type.

@bedroesb
Copy link
Collaborator

I am very grateful for you contributions btw, you seemed to have quickly find your way through the (not so well documented) code of templates creating templates :) It's one way of having them automatically up to date + client side validation using XSD's.

@bedroesb
Copy link
Collaborator

bedroesb commented Jan 3, 2025

@cgirardot I just tried out your changes and added a new column to the xlsx /tsv files.

test title
test description
test
test

But they do not end up in the submitted XMLs. Do I understand your feature wrong? That this would allow the 'uptake' of any custom header that is submitted?

@bedroesb
Copy link
Collaborator

bedroesb commented Jan 3, 2025

Could you add a text scenario for your changes? Like a second experiment.tsv file or something.

@cgirardot
Copy link
Author

@cgirardot I just tried out your changes and added a new column to the xlsx /tsv files.

test title
test description
test
test
But they do not end up in the submitted XMLs. Do I understand your feature wrong? That this would allow the 'uptake' of any custom header that is submitted?

Hi, I will. But shortly, as explained above the header must follow the convention:

schema_attribute[tag]

For example,in the sample table, you could add a column for the "age" attribute with the header sample_attribute[age]

@bedroesb
Copy link
Collaborator

bedroesb commented Jan 3, 2025

Oh This worked indeed. We will heave to document this in the README. When testing I noticed that this is not working for the runs table, since you don't have the row variable available there. I think conceptionally this is not gonna work there since there can be multiple rows/files per experiment block. This means that if you have for 2 files (2 rows), PAIRED, a different value in the run_attribute[tag] heading, you will not know which one to take.

@cgirardot
Copy link
Author

Oh This worked indeed. We will heave to document this in the README. When testing I noticed that this is not working for the runs table, since you don't have the row variable available there. I think conceptionally this is not gonna work there since there can be multiple rows/files per experiment block. This means that if you have for 2 files (2 rows), PAIRED, a different value in the run_attribute[tag] heading, you will not know which one to take.

Indeed, I think this is why I did not add it for the run. On the other hand, people should consistently annotate their data and understand what they do, so we could document that in such case the behavior is random.

@bedroesb
Copy link
Collaborator

bedroesb commented Jan 3, 2025

But it is giving an error at the moment and I don't think the random behavior is desired. I would just keep it out of the runs table, easier to explain too. Except if you can come up with a runs related attribute that should be stored on ENA right away.

@cgirardot
Copy link
Author

@bedroesb I pushed basic doc and an example tsv file. Is this what you meant?

@bedroesb
Copy link
Collaborator

@cgirardot This was exactly what I meant! Thank you.

Copy link
Collaborator

@bedroesb bedroesb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for me now! @bgruening What do you think?

@bedroesb bedroesb requested a review from bgruening January 14, 2025 14:54
@bedroesb
Copy link
Collaborator

I just realize that I did not test

Additionally, support for read_type and read_label (as new headers in the ena_run.tsv) is added to the run XML for files of type fastq to support single cell situations where more than 2 fastq files are available (ENA then requires to have read_type described). Multiple values can be passed using CSV format eg paired,cell_barcode

Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks very good to me. Judging the Python code, not the template updates :)

@@ -36,7 +36,7 @@ jobs:
echo "password: ${{ secrets.ENA_PASSWORD }}" >> .secrets.yml
- name: Test submission in --draft mode
run: |
ena-upload-cli --action add --draft --dev --center ${{ secrets.ENA_CENTER }} --data example_data/ENA_TEST1.R1.fastq.gz example_data/ENA_TEST2.R1.fastq.gz example_data/ENA_TEST2.R2.fastq.gz --checklist ERC000033 --secret .secret.yml --xlsx example_tables/ENA_excel_example_ERC000033.xlsx
ena-upload-cli --action add --draft --dev --center TEST --data example_data/ENA_TEST1.R1.fastq.gz example_data/ENA_TEST2.R1.fastq.gz example_data/ENA_TEST2.R2.fastq.gz --checklist ERC000033 --secret .secret.yml --xlsx example_tables/ENA_excel_example_ERC000033.xlsx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended?

@bgruening
Copy link
Member

@bedroesb if you have tested feel free to merge.

@bedroesb
Copy link
Collaborator

@cgirardot

I found an example for spot descriptor here, with READ_TYPE part of it.

<SPOT_DESCRIPTOR>
  <SPOT_DECODE_SPEC>
    <READ_SPEC>
      <READ_INDEX>0</READ_INDEX>
      <READ_LABEL>F1</READ_LABEL>
      <READ_CLASS>Application Read</READ_CLASS>
      <READ_TYPE>Forward</READ_TYPE>
      <BASE_COORD>1</BASE_COORD>
    </READ_SPEC>
    <READ_SPEC>
      <READ_INDEX>1</READ_INDEX>
      <READ_LABEL>R2</READ_LABEL>
      <READ_CLASS>Application Read</READ_CLASS>
      <READ_TYPE>Reverse</READ_TYPE>
      <RELATIVE_ORDER follows_read_index="0"/>
    </READ_SPEC>
  </SPOT_DECODE_SPEC>
</SPOT_DESCRIPTOR>

The only problem I see here is that again, similar to the run attributes, the table is composed on a 1 file / row concept. So I don't know how to add to the table info / run. Maybe we can assume if the runs have the same RUN alias, that they can be collapsed and be part of the same run 🤔

@cgirardot
Copy link
Author

@cgirardot

I found an example for spot descriptor here, with READ_TYPE part of it.

<SPOT_DESCRIPTOR>
  <SPOT_DECODE_SPEC>
    <READ_SPEC>
      <READ_INDEX>0</READ_INDEX>
      <READ_LABEL>F1</READ_LABEL>
      <READ_CLASS>Application Read</READ_CLASS>
      <READ_TYPE>Forward</READ_TYPE>
      <BASE_COORD>1</BASE_COORD>
    </READ_SPEC>
    <READ_SPEC>
      <READ_INDEX>1</READ_INDEX>
      <READ_LABEL>R2</READ_LABEL>
      <READ_CLASS>Application Read</READ_CLASS>
      <READ_TYPE>Reverse</READ_TYPE>
      <RELATIVE_ORDER follows_read_index="0"/>
    </READ_SPEC>
  </SPOT_DECODE_SPEC>
</SPOT_DESCRIPTOR>

The only problem I see here is that again, similar to the run attributes, the table is composed on a 1 file / row concept. So I don't know how to add to the table info / run. Maybe we can assume if the runs have the same RUN alias, that they can be collapsed and be part of the same run 🤔

I also found an example from the EGA XML submission help page where the SPOT descriptor is in the experiment.xml. So we could grab the spot information from the experiment; but this sounds like a MR in its own. I am working on EGA submission where such spot info is needed, I'll see if I can find a solution in a new MR (no promise).

I suggest we keep this MR as is, merge and close :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants