Remove Duplicates in mirtop.tsv removes more lines than expected #174

DwinGrashof · 2022-07-22T12:56:59Z

Description of the bug

There is a difference between the mirtop.tsv and mirna.tsv count files. The difference between these two files comes from the R script collapse_mirtop.r where the counts for isomirs per miRNA are summed after removing UID duplicates using the code:

counts = as.data.table(df[!duplicated(df[["UID"]]),c(3, 13:ncol(df))])
mirna = counts[, lapply(.SD, sum), by = miRNA]

However, some of the duplicated UIDs are removed but I don't know why you would choose to do so. For example, in miRNA hsa-miR-7-5p I have a duplicated UID with the following stats and counts:

UID	                read	                    miRNA	        Variant	    S1	   S2	  S3	S4	   S5	  S6
iso-24-9BOX7EZWKE	TGGAAGACTAGTGATTTTGTTGTT	hsa-miR-7-5p	NA	        35385  0	  0	    0	   0	  0
iso-24-9BOX7EZWKE	TGGAAGACTAGTGATTTTGTTGTT	hsa-miR-7-5p	NA	        0	   22370  42567	24653  0	  57649
iso-24-9BOX7EZWKE	TGGAAGACTAGTGATTTTGTTGTT	hsa-miR-7-5p	NA	        0	   0	  0	    0	   62297  0

There are more examples like this for hsa-miR-7-5p, where duplicated UIDs with identical variants are removed, although they have quite some counts. This reduces the counts from 150 000 to 70 000 in mirtop and mirna files respectively.
Is the removal of these duplicated UIDs on purpose, or maybe a bug where the above UID should be:

UID	                read	                    miRNA	        Variant	  S1	S2	    S3	    S4	    S5	    S6
iso-24-9BOX7EZWKE	TGGAAGACTAGTGATTTTGTTGTT	hsa-miR-7-5p	NA	      35385	22370	42567	24653	62297	57649

Command used and terminal output

N/A

Relevant files

Added are two files:
mirtop.tsv <- the results from mirtop
mirna.tsv <- the deduplicates results after collapse_mirtop.r
mirna_mirtop_files.zip

System information

No response

The text was updated successfully, but these errors were encountered:

lpantano · 2022-07-25T20:39:35Z

Hi @DwinGrashof,

can you check on this code and look at the table this generates to see if it seems better:

library(data.table)
input="~/Downloads/mirna_mirtop_files/mirtop.tsv"
df = read.delim(input[1], sep = "\t")

dt=as.data.table(df)
# summarize by UID and miRNA name
tmp=dt[, lapply(.SD, sum, na.rm=TRUE), by=.(UID,miRNA),.SDcols=colnames(dt)[13:ncol(dt)]]
# remove duplicates
tmp=as.data.frame(tmp)
counts = tmp[!duplicated(tmp[["UID"]]),c(2:ncol(tmp))]
counts=as.data.table(counts)
mirna = counts[, lapply(.SD, sum), by = miRNA]

I can push this into dev if this seems to address the issue and not mess up with other things :)

DwinGrashof · 2022-08-16T08:18:48Z

Hi @lpantano ,

Sorry for my late response, I've been busy during these holiday times.

I've looked through the code and it does seem to work better in most cases than the current mirna code.
I do notice though, that some UIDs (iso-22-U4K7BB8Z & iso-22-FPJYUP6XO for example in my dataset).

hsa-miR-365a-3p and hsa-miR-365b-3p have the same counts with all the same UIDs in my dataset.
Because they only have common UIDs all duplicated UIDs are removed, resulting in 0 counts for hsa-miR-365a-3p and only counts for hsa-miR-365b-3p.

For UID iso-22-FPJYUP6XO counts are found for both hsa-miR-107 and hsa-miR-103a-3p. Since these miRNA have both overlapping and unique UIDs, both are reported, but counts for hsa-miR-107 are much lower compared to when all its UIDs are summed.

These "problems" are however very much debatable. Retaining counts from UIDs which are in essence ambiguous, where we do not know which miRNA is actually expressed, if not both. There are of course multiple methods to handle these kinds of problems. In the new R code, the first instance of a duplicated UID is retained and the other duplicated instances are removed. This is a rather arbitrary way of handling these ambiguities. Other options could be removing all not-unique UIDs after the tmp=dt[, lapply(.SD, sum, na.rm=TRUE), by=.(UID,miRNA),.SDcols=colnames(dt)[13:ncol(dt)], line. Since this line does nicely get rid of a previous issue stated in my initial post.

This is just something to think about and I completely understand if you would rather keep it as coded in your previous post :)

Best,
Dwin

atrigila · 2024-08-19T16:34:49Z

Hi @DwinGrashof and @lpantano,
I think the key issue here is the following: "In the new R code, the first instance of a duplicated UID is retained and the other duplicated instances are removed. This is a rather arbitrary way of handling these ambiguities."

This is due to how the function duplicated() works in R in vectors (in our case, the UIDs). The first occurrence of each element is marked as FALSE, therefore, it is the only one that is returned.

miRNA_names <- c("hsa-let-7a", "hsa-miR-21", "hsa-miR-155", 
                 "hsa-let-7a", "hsa-miR-21", "hsa-miR-223", "hsa-miR-155")
dup_flags <- duplicated(miRNA_names)
dup_flags 

[1] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE

From the updated code from Lorena and the suggestions from Dwin, I propose to replace the line to remove duplicates for this one: filtered_tmp <- tmp[!duplicated(UID) & !duplicated(UID, fromLast = TRUE)], which will retain only the UIDs that are unique (and not the first occurence of those UIDs), which would make the final code look like this:

df <- read.table("~/R/mirtop.tsv", header = TRUE)
dt <- as.data.table(df)
# Summarize the data by UID and miRNA
tmp <- dt[, lapply(.SD, sum, na.rm = TRUE), by = .(UID, miRNA), .SDcols = colnames(dt)[13:ncol(dt)]]
# Identify and filter out non-unique UIDs
filtered_tmp <- tmp[!duplicated(UID) & !duplicated(UID, fromLast = TRUE)]
# Summarize the unique data by miRNA 
mirna <- filtered_tmp[, lapply(.SD, sum, na.rm = TRUE), by = miRNA, .SDcols = 3:ncol(filtered_tmp)]
# Write mirna table
write.table(mirna, file.path(dirname(input[1]), "mirna.tsv"), quote = FALSE, sep = "\t", row.names = FALSE)

Let me know what you think.

lpantano · 2024-08-20T14:30:26Z

We are going to remove a lot of good data if we do that. There are many sequences that mapped to the same mature but coming from different precursors. We shouldn't remove those lines.

The case of 365 family, I wouldn't be worried, since the sequence it is exactly the same, it would be nice to say that probably that miRNA, could be one or another.

For the second case, I totally understand.

It is a good point, I don't think we keep only uniques because it would remove a lot of data that will mislead the results. We need to think a better way to solve this without removing duplicates but trying to be fair to the expression of different possibles miRNA.

I would suggest to leave this for Barcelona Summit. I can be in charge to solve this issue, I have done some other strategies that are better than this and I could adapt it to the pipeline.

DwinGrashof added the bug Something isn't working label Jul 22, 2022

lpantano self-assigned this Jul 22, 2022

apeltzer added this to the Release 2.1.0 milestone Jul 22, 2022

apeltzer modified the milestones: Release 2.1.0, Release 2.1.1 Patch Oct 11, 2022

apeltzer added help wanted Extra attention is needed question Further information is requested labels Oct 11, 2022

apeltzer removed this from the Release 2.2.0 milestone Sep 1, 2023

apeltzer added this to the 2.4.0 milestone Aug 8, 2024

atrigila self-assigned this Aug 19, 2024

This was referenced Aug 20, 2024

Add useful testing data for smrnaseq/mirtop nf-core/test-datasets#1291

Merged

Add mirtop collapse nftest #382

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Duplicates in mirtop.tsv removes more lines than expected #174

Remove Duplicates in mirtop.tsv removes more lines than expected #174

DwinGrashof commented Jul 22, 2022 •

edited

Loading

lpantano commented Jul 25, 2022

DwinGrashof commented Aug 16, 2022

atrigila commented Aug 19, 2024

lpantano commented Aug 20, 2024

Remove Duplicates in mirtop.tsv removes more lines than expected #174

Remove Duplicates in mirtop.tsv removes more lines than expected #174

Comments

DwinGrashof commented Jul 22, 2022 • edited Loading

Description of the bug

Command used and terminal output

Relevant files

System information

lpantano commented Jul 25, 2022

DwinGrashof commented Aug 16, 2022

atrigila commented Aug 19, 2024

lpantano commented Aug 20, 2024

DwinGrashof commented Jul 22, 2022 •

edited

Loading