Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider support of IUPAC codes in matrixOfRandomBarcodes or convenience function to derive possible barcodes #4

Open
j-andrews7 opened this issue Jul 9, 2023 · 3 comments

Comments

@j-andrews7
Copy link

Rather than truly random synthesis, there are semi-random barcode libraries that adhere to a given structure (e.g. clonTracer). In such cases, the ability to provide constrained barcode templates could be convenient.

Of course, the user could just generate them themselves and feed them to choices in the appropriate function with something like:

library(Biostrings)

iupac_codes <- list(
  A = "A", C = "C", G = "G", T = "T", 
  R = c("A", "G"), Y = c("C", "T"), S = c("G", "C"), 
  W = c("A", "T"), K = c("G", "T"), M = c("A", "C"), 
  B = c("C", "G", "T"), D = c("A", "G", "T"), H = c("A", "C", "T"), 
  V = c("A", "C", "G"), N = c("A", "C", "G", "T")
)

generate_possible_barcodes <- function(dna_string) {
  
  expand_char <- function(char) {
    iupac_codes[[toupper(char)]]
  }

  chars <- strsplit(dna_string, split = "")[[1]]
  possible_sequences_list <- lapply(chars, expand_char)
  sequences <- expand.grid(possible_sequences_list, stringsAsFactors = FALSE)

  sequences_str <- do.call(paste0, sequences)

  return(sequences_str)
}

sequences <- generate_possible_barcodes("ACTGWSWSWSAA")
print(sequences)
 [1] "ACTGAGAGAGAA" "ACTGTGAGAGAA" "ACTGACAGAGAA" "ACTGTCAGAGAA" "ACTGAGTGAGAA" "ACTGTGTGAGAA" "ACTGACTGAGAA" "ACTGTCTGAGAA" "ACTGAGACAGAA"
[10] "ACTGTGACAGAA" "ACTGACACAGAA" "ACTGTCACAGAA" "ACTGAGTCAGAA" "ACTGTGTCAGAA" "ACTGACTCAGAA" "ACTGTCTCAGAA" "ACTGAGAGTGAA" "ACTGTGAGTGAA"
[19] "ACTGACAGTGAA" "ACTGTCAGTGAA" "ACTGAGTGTGAA" "ACTGTGTGTGAA" "ACTGACTGTGAA" "ACTGTCTGTGAA" "ACTGAGACTGAA" "ACTGTGACTGAA" "ACTGACACTGAA"
[28] "ACTGTCACTGAA" "ACTGAGTCTGAA" "ACTGTGTCTGAA" "ACTGACTCTGAA" "ACTGTCTCTGAA" "ACTGAGAGACAA" "ACTGTGAGACAA" "ACTGACAGACAA" "ACTGTCAGACAA"
[37] "ACTGAGTGACAA" "ACTGTGTGACAA" "ACTGACTGACAA" "ACTGTCTGACAA" "ACTGAGACACAA" "ACTGTGACACAA" "ACTGACACACAA" "ACTGTCACACAA" "ACTGAGTCACAA"
[46] "ACTGTGTCACAA" "ACTGACTCACAA" "ACTGTCTCACAA" "ACTGAGAGTCAA" "ACTGTGAGTCAA" "ACTGACAGTCAA" "ACTGTCAGTCAA" "ACTGAGTGTCAA" "ACTGTGTGTCAA"
[55] "ACTGACTGTCAA" "ACTGTCTGTCAA" "ACTGAGACTCAA" "ACTGTGACTCAA" "ACTGACACTCAA" "ACTGTCACTCAA" "ACTGAGTCTCAA" "ACTGTGTCTCAA" "ACTGACTCTCAA"
[64] "ACTGTCTCTCAA"

But what's the fun in that, really? Probably simpler ways to do it, I am not great with Biostrings.

@LTLA
Copy link
Collaborator

LTLA commented Jul 10, 2023

Hm. Well, for your immediate problem, matrixOfRandomBarcodes will just report what it finds, so you might as well resolve ambiguous codes outside of the function. Helper functions notwithstanding, I'd be doing the same.

In the more general case, the various functions that use a known barcode pool could be modified to accept IUPAC codes in the barcodes. It's not too hard, but it's not a simple change, either, so I don't have a time frame for that.

@LTLA
Copy link
Collaborator

LTLA commented Aug 11, 2023

Making some progress with crisprVerse/kaori#10.

@LTLA
Copy link
Collaborator

LTLA commented Aug 14, 2023

All of the functions should now support IUPAC codes (except U and gaps) in the known barcode sequences:

# Creating an example dual barcode sequencing experiment.
known.pool <- c("AAARRRAAA", "AAAYYYAAA")

actual <- c("AAAAAAAAA", "AAACCCAAA", "AAAGGGAAA", "AAATTTAAA")
barcodes <- sprintf("CAGCTACGTACG%sCCAGCTCGATCG", actual)
names(barcodes) <- seq_along(barcodes)

library(Biostrings)
tmp <- tempfile(fileext=".fastq")
writeXStringSet(DNAStringSet(barcodes), filepath=tmp, format="fastq")
                                                                      
# Two of each:
countSingleBarcodes(tmp, choices=known.pool,
    template="CAGCTACGTACGNNNNNNNNNCCAGCTCGATCG")
## DataFrame with 2 rows and 2 columns
##       choices    counts
##   <character> <integer>
## 1   AAARRRAAA         2
## 2   AAAYYYAAA         2

In addition, we have a matchBarcodes() function to match observed sequences to a known pool:

choices <- c("AAARAA", "CCCYCC", "GGGMGG", "TTTSTT")
matchBarcodes(c("AAAAAA", "AAAGAA"), choices)
## DataFrame with 2 rows and 2 columns
##       index mismatches
##   <integer>  <integer>
## 1         1          0
## 2         1          0

This should be available in 1.1.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants