Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Duplicates in mirtop.tsv removes more lines than expected #174

Open
DwinGrashof opened this issue Jul 22, 2022 · 4 comments
Open
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed question Further information is requested
Milestone

Comments

@DwinGrashof
Copy link

DwinGrashof commented Jul 22, 2022

Description of the bug

There is a difference between the mirtop.tsv and mirna.tsv count files. The difference between these two files comes from the R script collapse_mirtop.r where the counts for isomirs per miRNA are summed after removing UID duplicates using the code:

counts = as.data.table(df[!duplicated(df[["UID"]]),c(3, 13:ncol(df))])
mirna = counts[, lapply(.SD, sum), by = miRNA]

However, some of the duplicated UIDs are removed but I don't know why you would choose to do so. For example, in miRNA hsa-miR-7-5p I have a duplicated UID with the following stats and counts:

UID	                read	                    miRNA	        Variant	    S1	   S2	  S3	S4	   S5	  S6
iso-24-9BOX7EZWKE	TGGAAGACTAGTGATTTTGTTGTT	hsa-miR-7-5p	NA	        35385  0	  0	    0	   0	  0
iso-24-9BOX7EZWKE	TGGAAGACTAGTGATTTTGTTGTT	hsa-miR-7-5p	NA	        0	   22370  42567	24653  0	  57649
iso-24-9BOX7EZWKE	TGGAAGACTAGTGATTTTGTTGTT	hsa-miR-7-5p	NA	        0	   0	  0	    0	   62297  0

There are more examples like this for hsa-miR-7-5p, where duplicated UIDs with identical variants are removed, although they have quite some counts. This reduces the counts from 150 000 to 70 000 in mirtop and mirna files respectively.
Is the removal of these duplicated UIDs on purpose, or maybe a bug where the above UID should be:

UID	                read	                    miRNA	        Variant	  S1	S2	    S3	    S4	    S5	    S6
iso-24-9BOX7EZWKE	TGGAAGACTAGTGATTTTGTTGTT	hsa-miR-7-5p	NA	      35385	22370	42567	24653	62297	57649

Command used and terminal output

N/A

Relevant files

Added are two files:
mirtop.tsv <- the results from mirtop
mirna.tsv <- the deduplicates results after collapse_mirtop.r
mirna_mirtop_files.zip

System information

No response

@DwinGrashof DwinGrashof added the bug Something isn't working label Jul 22, 2022
@lpantano lpantano self-assigned this Jul 22, 2022
@apeltzer apeltzer added this to the Release 2.1.0 milestone Jul 22, 2022
@lpantano
Copy link
Contributor

Hi @DwinGrashof,

can you check on this code and look at the table this generates to see if it seems better:

library(data.table)
input="~/Downloads/mirna_mirtop_files/mirtop.tsv"
df = read.delim(input[1], sep = "\t")

dt=as.data.table(df)
# summarize by UID and miRNA name
tmp=dt[, lapply(.SD, sum, na.rm=TRUE), by=.(UID,miRNA),.SDcols=colnames(dt)[13:ncol(dt)]]
# remove duplicates
tmp=as.data.frame(tmp)
counts = tmp[!duplicated(tmp[["UID"]]),c(2:ncol(tmp))]
counts=as.data.table(counts)
mirna = counts[, lapply(.SD, sum), by = miRNA]

I can push this into dev if this seems to address the issue and not mess up with other things :)

@DwinGrashof
Copy link
Author

Hi @lpantano ,

Sorry for my late response, I've been busy during these holiday times.

I've looked through the code and it does seem to work better in most cases than the current mirna code.
I do notice though, that some UIDs (iso-22-U4K7BB8Z & iso-22-FPJYUP6XO for example in my dataset).

hsa-miR-365a-3p and hsa-miR-365b-3p have the same counts with all the same UIDs in my dataset.
Because they only have common UIDs all duplicated UIDs are removed, resulting in 0 counts for hsa-miR-365a-3p and only counts for hsa-miR-365b-3p.

For UID iso-22-FPJYUP6XO counts are found for both hsa-miR-107 and hsa-miR-103a-3p. Since these miRNA have both overlapping and unique UIDs, both are reported, but counts for hsa-miR-107 are much lower compared to when all its UIDs are summed.

These "problems" are however very much debatable. Retaining counts from UIDs which are in essence ambiguous, where we do not know which miRNA is actually expressed, if not both. There are of course multiple methods to handle these kinds of problems. In the new R code, the first instance of a duplicated UID is retained and the other duplicated instances are removed. This is a rather arbitrary way of handling these ambiguities. Other options could be removing all not-unique UIDs after the tmp=dt[, lapply(.SD, sum, na.rm=TRUE), by=.(UID,miRNA),.SDcols=colnames(dt)[13:ncol(dt)], line. Since this line does nicely get rid of a previous issue stated in my initial post.

This is just something to think about and I completely understand if you would rather keep it as coded in your previous post :)

Best,
Dwin

@apeltzer apeltzer modified the milestones: Release 2.1.0, Release 2.1.1 Patch Oct 11, 2022
@apeltzer apeltzer added help wanted Extra attention is needed question Further information is requested labels Oct 11, 2022
@apeltzer apeltzer removed this from the Release 2.2.0 milestone Sep 1, 2023
@apeltzer apeltzer added this to the 2.4.0 milestone Aug 8, 2024
@atrigila
Copy link

Hi @DwinGrashof and @lpantano,
I think the key issue here is the following: "In the new R code, the first instance of a duplicated UID is retained and the other duplicated instances are removed. This is a rather arbitrary way of handling these ambiguities."

This is due to how the function duplicated() works in R in vectors (in our case, the UIDs). The first occurrence of each element is marked as FALSE, therefore, it is the only one that is returned.

miRNA_names <- c("hsa-let-7a", "hsa-miR-21", "hsa-miR-155", 
                 "hsa-let-7a", "hsa-miR-21", "hsa-miR-223", "hsa-miR-155")
dup_flags <- duplicated(miRNA_names)
dup_flags 

[1] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE

From the updated code from Lorena and the suggestions from Dwin, I propose to replace the line to remove duplicates for this one: filtered_tmp <- tmp[!duplicated(UID) & !duplicated(UID, fromLast = TRUE)], which will retain only the UIDs that are unique (and not the first occurence of those UIDs), which would make the final code look like this:

df <- read.table("~/R/mirtop.tsv", header = TRUE)
dt <- as.data.table(df)
# Summarize the data by UID and miRNA
tmp <- dt[, lapply(.SD, sum, na.rm = TRUE), by = .(UID, miRNA), .SDcols = colnames(dt)[13:ncol(dt)]]
# Identify and filter out non-unique UIDs
filtered_tmp <- tmp[!duplicated(UID) & !duplicated(UID, fromLast = TRUE)]
# Summarize the unique data by miRNA 
mirna <- filtered_tmp[, lapply(.SD, sum, na.rm = TRUE), by = miRNA, .SDcols = 3:ncol(filtered_tmp)]
# Write mirna table
write.table(mirna, file.path(dirname(input[1]), "mirna.tsv"), quote = FALSE, sep = "\t", row.names = FALSE)

Let me know what you think.

@atrigila atrigila self-assigned this Aug 19, 2024
@lpantano
Copy link
Contributor

We are going to remove a lot of good data if we do that. There are many sequences that mapped to the same mature but coming from different precursors. We shouldn't remove those lines.

The case of 365 family, I wouldn't be worried, since the sequence it is exactly the same, it would be nice to say that probably that miRNA, could be one or another.

For the second case, I totally understand.

It is a good point, I don't think we keep only uniques because it would remove a lot of data that will mislead the results. We need to think a better way to solve this without removing duplicates but trying to be fair to the expression of different possibles miRNA.

I would suggest to leave this for Barcelona Summit. I can be in charge to solve this issue, I have done some other strategies that are better than this and I could adapt it to the pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed question Further information is requested
Projects
Status: On Hold
Development

No branches or pull requests

4 participants