Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

questions about the output #140

Open
ZYongQi opened this issue Jan 25, 2024 · 7 comments
Open

questions about the output #140

ZYongQi opened this issue Jan 25, 2024 · 7 comments

Comments

@ZYongQi
Copy link

ZYongQi commented Jan 25, 2024

Hi,thisi is ZY.I sued freec to call CNVs in the genome successfully.But still two questions:

  1. the output like this:
    ID=gene-POFUT2 1 1255210 1274664 1264440 1302816 0 loss
    ID=gene-DYRK1A 1 7417053 7572985 7516776 7556136 8 gain
    ID=gene-TTC3 1 7693717 7749619 7699800 7826736 10 gain
    ID=gene-LOC117795648 1 7741425 7741527 7699800 7826736 10 gain
    ID=gene-LOC117801378 1 7751983 7752392 7699800 7826736 10 gain
    ID=gene-LOC100480655 1 7791807 7792944 7699800 7826736 10 gain
    ID=gene-LOC117801382 1 7795812 7796932 7699800 7826736 10 gain
    ID=gene-LOC117801055 1 7806381 7811440 7699800 7826736 10 gain
    ID=gene-LOC117801383 1 7820518 7824867 7699800 7826736 10 gain
    ID=gene-HLCS 1 7834905 8035784 7925136 7958592 0 loss
    ID=gene-LOC117801776 1 44252319 44863795 44619480 44641128 0 loss

it contains predicted copy number.I wonder what it refers if this value equals 0 ?

  1. CNV is a region on genome,whose sizes approximately ranges from 1kb to 3Mb.How can I get the gene copy numbers from CNVs?

Thank you for your any valuable advice.Best wishes to you!

@valeu
Copy link
Contributor

valeu commented Jan 25, 2024

Hello,

  1. 0 means Zero copies of DNA in this region predited.
  2. I guess you need to look at this value before 'gain' and 'loss'. Also visualize the ratio.txt information on the normalized ratio to make sure that the prediction is correct.

@ZYongQi
Copy link
Author

ZYongQi commented Jan 26, 2024

Hello,

  1. 0 means Zero copies of DNA in this region predited.
  2. I guess you need to look at this value before 'gain' and 'loss'. Also visualize the ratio.txt information on the normalized ratio to make sure that the prediction is correct.

Thank you for your reply.I 'll visualize the ratio.txt information on the normalized ratio at once .Now please allow me to simply introduce my "config.txt".And I've been confused about the "CNVs file".
This is part of my config file:

ploidy = 2
breakPointThreshold = 0.8
maxThreads = 16
minExpectedGC = 0.35
maxExpectedGC = 0.55
telocentromeric = 0
coefficientOfVariation = 0.062
degree = 3

  1. I chose the value coefficientOfVariation rather than a fixed bin size.In that case,freec can choose an optimal window size for each sample.Will different windows influence the analysis if I try to combine these CNVs output of different samples? Or will you suggest me to choose a fixed window size like 100bp or else? By the way,the value 0.062 comes from a similar research.

  2. I try to locate the CNVs to the gene like this:

GENE_ID CHROMOSOME GENE_START GENE_STOP CNV_START CNV_STOP CN TYPE
ID=gene-POFUT2 1 1255210 1274664 1264440 1302816 0 loss
ID=gene-DYRK1A 1 7417053 7572985 7516776 7556136 8 gain
ID=gene-TTC3 1 7693717 7749619 7699800 7826736 10 gain
ID=gene-LOC117795648 1 7741425 7741527 7699800 7826736 10 gain
ID=gene-LOC117801378 1 7751983 7752392 7699800 7826736 10 gain
ID=gene-LOC100480655 1 7791807 7792944 7699800 7826736 10 gain
ID=gene-LOC117801382 1 7795812 7796932 7699800 7826736 10 gain
ID=gene-LOC117801055 1 7806381 7811440 7699800 7826736 10 gain
ID=gene-LOC117801383 1 7820518 7824867 7699800 7826736 10 gain

I wonder the connection between CN and GENE_location(start and stop).10 means 10 copies of DNA in the region predicted.Does it mean a CNV repeat 10 times or just 10 different CNVs?If I want to count the numbers of gain and loss,do I need to multiply by 10?

@valeu
Copy link
Contributor

valeu commented Jan 26, 2024

coefficientOfVariation = 0.062 will give you some OK window side that will not result in too much noise and false predictions. If this value calculated by FREEC is close to 100, just use window=100 and it will overwrite coefficientOfVariation. Also, you can use a rule of thumb: 400 reads per window will result in low noise and nice predictions.

@valeu
Copy link
Contributor

valeu commented Jan 26, 2024

Regarding the annotation of genes - I don't think that there is an official FREEC script to do so. How do you get this file with gene IDs?

@ZYongQi
Copy link
Author

ZYongQi commented Jan 26, 2024

Regarding the annotation of genes - I don't think that there is an official FREEC script to do so. How do you get this file with gene IDs?

I did make the annotation myself through a perl script.Actually I did the step on the base of the position of predicted CNVs in the output file from FREEC.

To be specific,at first I got the position(start-end) of each gene in the .gff file from NCBI.Second,I looked for genes that overlap with CNV regions by the following standard:**cnv_start<=gene_stop && cnv_stop >=gene_start.**In this way,I will get a gene list whose position(start-end) overlaps with CNVs.Finally I merged the two file. Is this step any problems?

By the way,I 've got the ratio.txt,but I wonder how the ratio value is calculated. Should I filter out ratio values that don't meet a certain threshold? And why the copy number in the ratio.txt appears all 2?

I would appreciate it if your any advice is helpful.Best wishes!

@valeu
Copy link
Contributor

valeu commented Feb 1, 2024

The copy number of the ratio.txt for the control sample should be 2 if you use a control. For the donor sample, it can be 2 almost everywhere if it is not a cancer sample. In any case, I suggest visualizing the output (ratio.txt) to make sure you can trust the predictions of FREE (using for example the R script included in the package).
The ratios are normalized read count values. 1 means no change. -1 means Data not available.

@ZYongQi
Copy link
Author

ZYongQi commented Jul 10, 2024

Hi,thisi is ZY.We did a summary on the quantity and distribution of CNVs and CNV regions . And I took your advice to visualize the ratio.txt file.But still doubted.

R script:FREEC_ratio2Absolute.R. One of the outputs shows:

Chromosome Start End Num_Probes Segment_Mean
NC_048218.1 1 1264440 1285 -0.0513244
NC_048218.1 1264441 1302816 39 -3.715107
NC_048218.1 1302817 3479424 2212 -0.05671026
NC_048218.1 3479425 3504024 25 -4.576851
NC_048218.1 3504025 3536496 33 0.01631089

What kind of criteria should we use to filter the results? The number of probes or a specific segment_mean?
By the way, why some of segment_means equal -Inf?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants