-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for custom attributes and read type description of fastq #102
base: master
Are you sure you want to change the base?
Allow for custom attributes and read type description of fastq #102
Conversation
…e[tag]' and inject each of these tags as a schema_attribute XML sequence. Additionally, support for read_type and read_label (not fully supported yet though) is added to the run XML (when present in the run ena table)
@cgirardot Thanks a lot for starting this. This is a long wanted feature. I was wondering if you could also test/make it work for xlsx files or ISA-JSON (which should be easy since they all, including the tsv's, get parsed to a pandas dataframe). The controlled vocabulary you added, could you add this to the xml updater script in |
@bedroesb thank you for your quick feedback.
I dont know when I can have a look at this. My data management system exports data using the tsv tables format so it was easy to get in my code to have the additional information exported with different studies available in our DM (single cell , not single cell ...). I am not sure how to get started with the other formats and test.
I am really not familiar with these templating frameworks (took me quite some time to get in...) but can take a look. The values are coming from:
Will try to find some time in the next days. Happy holidays! |
@bedroesb I forgot to mention about the pipeline failing; this is not on me as far as I can see, the command line is simply missing a value for the |
…ng without overriding the templates
@bedroesb I managed the second point ie updating the jinja templates so all XML templates are automatically generated. I also looked into the first point but I dont see how to approach to ISA-JSON easily, this feels out of scope of my contribution and requires much more knowledge into ISA-JSON than I have. One needs to catch which attributes need to be exported as Regarding the xlxs support, this seems it would just work out if the box when adding |
Intresting, 2 weeks ago this worked and I didn't change anything to it. Let me double check the secrets.
That is really great! And looks good at first sight. Related to the ISA-JSON, you are right, I will do a test with the test-isa-json and text-xlsx to see if everything still behaves as it should. Which reminds me that I should improve the testing to try out those + do some test submissions, only problem there is that the files, aliases one submits should be unique. |
To Do list for myself:
|
I am very grateful for you contributions btw, you seemed to have quickly find your way through the (not so well documented) code of templates creating templates :) It's one way of having them automatically up to date + client side validation using XSD's. |
@cgirardot I just tried out your changes and added a new column to the xlsx /tsv files.
But they do not end up in the submitted XMLs. Do I understand your feature wrong? That this would allow the 'uptake' of any custom header that is submitted? |
Could you add a text scenario for your changes? Like a second experiment.tsv file or something. |
Hi, I will. But shortly, as explained above the header must follow the convention: schema_attribute[tag] For example,in the sample table, you could add a column for the "age" attribute with the header sample_attribute[age] |
Oh This worked indeed. We will heave to document this in the README. When testing I noticed that this is not working for the runs table, since you don't have the row variable available there. I think conceptionally this is not gonna work there since there can be multiple rows/files per experiment block. This means that if you have for 2 files (2 rows), PAIRED, a different value in the run_attribute[tag] heading, you will not know which one to take. |
Indeed, I think this is why I did not add it for the run. On the other hand, people should consistently annotate their data and understand what they do, so we could document that in such case the behavior is random. |
But it is giving an error at the moment and I don't think the random behavior is desired. I would just keep it out of the runs table, easier to explain too. Except if you can come up with a runs related attribute that should be stored on ENA right away. |
@bedroesb I pushed basic doc and an example tsv file. Is this what you meant? |
…upload-cli into pr/cgirardot/102
@cgirardot This was exactly what I meant! Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for me now! @bgruening What do you think?
I just realize that I did not test
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this looks very good to me. Judging the Python code, not the template updates :)
@@ -36,7 +36,7 @@ jobs: | |||
echo "password: ${{ secrets.ENA_PASSWORD }}" >> .secrets.yml | |||
- name: Test submission in --draft mode | |||
run: | | |||
ena-upload-cli --action add --draft --dev --center ${{ secrets.ENA_CENTER }} --data example_data/ENA_TEST1.R1.fastq.gz example_data/ENA_TEST2.R1.fastq.gz example_data/ENA_TEST2.R2.fastq.gz --checklist ERC000033 --secret .secret.yml --xlsx example_tables/ENA_excel_example_ERC000033.xlsx | |||
ena-upload-cli --action add --draft --dev --center TEST --data example_data/ENA_TEST1.R1.fastq.gz example_data/ENA_TEST2.R1.fastq.gz example_data/ENA_TEST2.R2.fastq.gz --checklist ERC000033 --secret .secret.yml --xlsx example_tables/ENA_excel_example_ERC000033.xlsx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this intended?
@bedroesb if you have tested feel free to merge. |
I found an example for spot descriptor here, with <SPOT_DESCRIPTOR>
<SPOT_DECODE_SPEC>
<READ_SPEC>
<READ_INDEX>0</READ_INDEX>
<READ_LABEL>F1</READ_LABEL>
<READ_CLASS>Application Read</READ_CLASS>
<READ_TYPE>Forward</READ_TYPE>
<BASE_COORD>1</BASE_COORD>
</READ_SPEC>
<READ_SPEC>
<READ_INDEX>1</READ_INDEX>
<READ_LABEL>R2</READ_LABEL>
<READ_CLASS>Application Read</READ_CLASS>
<READ_TYPE>Reverse</READ_TYPE>
<RELATIVE_ORDER follows_read_index="0"/>
</READ_SPEC>
</SPOT_DECODE_SPEC>
</SPOT_DESCRIPTOR> The only problem I see here is that again, similar to the run attributes, the table is composed on a 1 file / row concept. So I don't know how to add to the table info / run. Maybe we can assume if the runs have the same RUN alias, that they can be collapsed and be part of the same run 🤔 |
I also found an example from the EGA XML submission help page where the SPOT descriptor is in the experiment.xml. So we could grab the spot information from the experiment; but this sounds like a MR in its own. I am working on EGA submission where such spot info is needed, I'll see if I can find a solution in a new MR (no promise). I suggest we keep this MR as is, merge and close :-) |
Implements #86
Allows to add "schema_attribute[tag]" (e.g. sample_attribute[treatment]) in the input schema tables (tsv only) where the schema ('sample', 'run', 'experiment', 'study') e.g. a new
sample_attribute[treatment]
column in theena_sample.tsv
. These extra headers are injected in the XML generation stream, and injected in the generated XML as aATTRIBUTE
sequence (templates where modified accordingly). For samples, only the default ERC000011 was modified to support these additional attributes. Unit is not yet supported.Additionally, support for
read_type
andread_label
(as new headers in the ena_run.tsv) is added to the run XML for files of typefastq
to support single cell situations where more than 2 fastq files are available (ENA then requires to have read_type described). Multiple values can be passed using CSV format egpaired,cell_barcode
Limitations:
read_label
is not fully supported as it would require to supportSpotDescriptorType
in the run XML but it is unclear how this information could be passed. Basic support forSPOT_DECODE_SPEC
with aREAD_SPEC
usingBASE_COORD
(see SRA.common.xsd) could be provided with:READ_SPEC 1
...READ_SPEC n
where the header number is theREAD_SPEC's READ_INDEX
READ_LABEL:READ_CLASS:READ_TYPE:BASE_COORD:SPOT_LENGTH
. For example:UMI1:Application Read:Other:1:8
.