You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When handling a report, megaqc loops over each data value and checks to see if that SampleDataType already exists. However it only checks on the basis of data_id, but ignores data_section. Therefore if multiple report types (data sections) reuse the same data_id, currently this will reuse that SampleDataType even if data_section is wrong for the incoming report.
This becomes problematic if you want to query for historic results based on data_section.
megaqc erroneously associates patient_id to only come from Pipeline_A_Result, even though in one case it comes from Pipeline_B_Result.
Specifically, the sample_data and sample_data_type tables will look like
sample_data_type
sample_data_type_id
data_id
data_section
data_key
schema
0
patient_id
Pipeline_A_Result-plot
Pipeline_A_Result-plot__patient_id
null
1
variant_count
Pipeline_A_Result-plot
Pipeline_A_Result-plot__variant_count
null
2
pvalue
Pipeline_B_Result-plot
Pipeline_B_Result-plot__pvalue
null
sample_data
sample_data_id
report_id
sample_data_type_id
sample_id
value
0
0
0
0
patient_1
1
0
1
0
10
2
1
0 (*)
1
patient_2
3
1
2
1
0.0001
* NOTE: sample_data_type_id=0 refers to data_section=Pipeline_A_Result-plot, even though this value actually came from Pipeline_B.
Expected behavior
data_id='patient_id' will appear in two separate sample_data_type rows, once with data_section='Pipeline_A_Result-plot' and once with data_section='Pipeline_B_Result-plot'
sample_data_type_id
data_id
data_section
data_key
schema
0
patient_id
Pipeline_A_Result-plot
Pipeline_A_Result-plot__patient_id
null
1
variant_count
Pipeline_A_Result-plot
Pipeline_A_Result-plot__variant_count
null
2
patient_id
Pipeline_B_Result-plot
Pipeline_B_Result-plot__patient_id
null
3
pvalue
Pipeline_B_Result-plot
Pipeline_B_Result-plot__pvalue
null
System
MegaQC: 0.3.0
The text was updated successfully, but these errors were encountered:
Describe the bug
When handling a report, megaqc loops over each data value and checks to see if that
SampleDataType
already exists. However it only checks on the basis ofdata_id
, but ignoresdata_section
. Therefore if multiple report types (data sections) reuse the samedata_id
, currently this will reuse thatSampleDataType
even ifdata_section
is wrong for the incoming report.This becomes problematic if you want to query for historic results based on data_section.
This is due to this code, which
sample_data_type
to see if the field's name has been seen beforedata_key = "{}__{}".format(section, d_key)
.But in step (1) it will reuse any key matching
d_key
, even ifsection
does not match.To Reproduce
Here is a barebones multiqc_config and set of report files that can reveal the issue.
multiqc_config.yaml
A_report.csv
(generated by Pipeline A)B_report.csv
(generated by Pipeline B)Steps:
megaqc erroneously associates
patient_id
to only come fromPipeline_A_Result
, even though in one case it comes fromPipeline_B_Result
.Specifically, the
sample_data
andsample_data_type
tables will look likesample_data_type
sample_data
Expected behavior
data_id='patient_id'
will appear in two separatesample_data_type
rows, once withdata_section='Pipeline_A_Result-plot'
and once withdata_section='Pipeline_B_Result-plot'
System
The text was updated successfully, but these errors were encountered: