Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug report - sample_data sometimes references incorrect sample_data_type entries #530

Open
djwooten opened this issue Jul 29, 2024 · 0 comments · May be fixed by #531
Open

Bug report - sample_data sometimes references incorrect sample_data_type entries #530

djwooten opened this issue Jul 29, 2024 · 0 comments · May be fixed by #531
Labels

Comments

@djwooten
Copy link

djwooten commented Jul 29, 2024

Describe the bug

When handling a report, megaqc loops over each data value and checks to see if that SampleDataType already exists. However it only checks on the basis of data_id, but ignores data_section. Therefore if multiple report types (data sections) reuse the same data_id, currently this will reuse that SampleDataType even if data_section is wrong for the incoming report.

This becomes problematic if you want to query for historic results based on data_section.

This is due to this code, which

  1. Checks sample_data_type to see if the field's name has been seen before
  2. If it has NOT been seen before, creates a new entry with data_key = "{}__{}".format(section, d_key).

But in step (1) it will reuse any key matching d_key, even if section does not match.

To Reproduce

Here is a barebones multiqc_config and set of report files that can reveal the issue.

multiqc_config.yaml

custom_data:
  Pipeline_A_Result:
    file_format: "csv"
  Pipeline_B_Result:
    file_format: "csv"
sp:
  Pipeline_A_Result:
    fn: "*A_report.csv"
  Pipeline_B_Result:
    fn: "*B_report.csv"

A_report.csv (generated by Pipeline A)

sample_id,patient_id,variant_count
sample_1,patient_1,10

B_report.csv (generated by Pipeline B)

sample_id,patient_id,pvalue
sample_2,patient_2,0.0001

Steps:

  1. Run pipeline A and submit its data to megaqc,
  2. Run pipeline B and submit its data to megaqc

megaqc erroneously associates patient_id to only come from Pipeline_A_Result, even though in one case it comes from Pipeline_B_Result.

Specifically, the sample_data and sample_data_type tables will look like

sample_data_type

sample_data_type_id data_id data_section data_key schema
0 patient_id Pipeline_A_Result-plot Pipeline_A_Result-plot__patient_id null
1 variant_count Pipeline_A_Result-plot Pipeline_A_Result-plot__variant_count null
2 pvalue Pipeline_B_Result-plot Pipeline_B_Result-plot__pvalue null

sample_data

sample_data_id report_id sample_data_type_id sample_id value
0 0 0 0 patient_1
1 0 1 0 10
2 1 0 (*) 1 patient_2
3 1 2 1 0.0001

* NOTE: sample_data_type_id=0 refers to data_section=Pipeline_A_Result-plot, even though this value actually came from Pipeline_B.

Expected behavior

data_id='patient_id' will appear in two separate sample_data_type rows, once with data_section='Pipeline_A_Result-plot' and once with data_section='Pipeline_B_Result-plot'

sample_data_type_id data_id data_section data_key schema
0 patient_id Pipeline_A_Result-plot Pipeline_A_Result-plot__patient_id null
1 variant_count Pipeline_A_Result-plot Pipeline_A_Result-plot__variant_count null
2 patient_id Pipeline_B_Result-plot Pipeline_B_Result-plot__patient_id null
3 pvalue Pipeline_B_Result-plot Pipeline_B_Result-plot__pvalue null

System

  • MegaQC: 0.3.0
@djwooten djwooten added the bug label Jul 29, 2024
@djwooten djwooten linked a pull request Jul 29, 2024 that will close this issue
5 tasks
@djwooten djwooten changed the title Bug report Bug report - sample_data sometimes references incorrect sample_data_type entries Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant