Final polishing updates

jlmaier12 · Dec 15, 2024 · a75b768 · a75b768
1 parent 070e86c
commit a75b768
Show file tree

Hide file tree

Showing 5 changed files with 76 additions and 63 deletions.
diff --git a/R/TrIdentClassifier.R b/R/TrIdentClassifier.R
@@ -15,7 +15,7 @@
 #' @param windowSize The number of basepairs to average read coverage values
 #'   over. Options are 100, 200, 500, 1000 ONLY.  Default is 1000.
 #' @param minBlockSize The minimum size (in bp) of the Prophage-like block
-#'   pattern. Default is 10000.
+#'   pattern. Default is 10000. Must be at least 1000.
 #' @param maxBlockSize The maximum size (in bp) of the Prophage-like block
 #'   pattern. Default is NA (no maximum).
 #' @param minContigLength The minimum contig size (in bp) to perform
@@ -52,7 +52,7 @@ TrIdentClassifier <- function(VLPpileup,
     if (!(windowSize %in% list(100, 200, 500, 1000))) {
         stop("windowSize must be either 100, 200, 500, or 1000 bp!")
     }
-    if (minContigLength < 25000) {
+    if (minContigLength <= 25000) {
         stop("minContigLength must be at least 25,000 bp for pattern-matching!")
     }
     if (minBlockSize <= 1000) {

diff --git a/R/specializedTransductionID.R b/R/specializedTransductionID.R
@@ -11,9 +11,11 @@
 #' @param VLPpileup VLP-fraction pileup file.
 #' @param TrIdentResults Output from `TrIdentClassifier()`
 #' @param noReadCov Number of basepairs of zero read coverage encountered before
-#'   specialized transduction searching stops. Default is 500.
+#'   specialized transduction searching stops. Default is 500. Must be at least
+#'   100.
 #' @param specTransLength Number of basepairs of non-zero read coverage needed
-#'   for specialized transduction to be considered. Default is 2000.
+#'   for specialized transduction to be considered. Default is 2000. Must be
+#'   at least 100.
 #' @param specificContig Optional, Search a specific contig classified as
 #'   Prophage-like ("NODE_1").
 #' @param matchScoreFilter Optional, Filter plots using the normalized pattern
@@ -48,6 +50,13 @@ specializedTransductionID <- function(VLPpileup,
                                       matchScoreFilter,
                                       logScale = FALSE,
                                       SaveFilesTo) {
+  ## error catching
+  if (noReadCov < 100) {
+    stop("noReadCov must be greater than or equal to 100!")
+  }
+  if (specTransLength < 100) {
+    stop("specTransLength must be greater than or equal to 100!")
+  }
     specTransInfo <- NULL
     TrIdentResultPatterns <- TrIdentResults[[3]]
     TrIdentResultSumm <- TrIdentResults[[1]]

diff --git a/man/TrIdentClassifier.Rd b/man/TrIdentClassifier.Rd
diff --git a/man/specializedTransductionID.Rd b/man/specializedTransductionID.Rd
diff --git a/vignettes/TrIdent-vignette.Rmd b/vignettes/TrIdent-vignette.Rmd
@@ -105,19 +105,19 @@ Transductomics allows for the identification of bacterial DNA being actively
 carried or transduced by VLPs. A transductomics dataset consists of two parts- 
 metagenomes from the whole-community and VLP fractions of a sample. The 
 whole-community fraction is generated by extracting and sequencing DNA from the 
-whole sample. The VLP-fraction is generated by DNA extraction and sequencing of 
+whole sample. The VLP-fraction is generated by extraction and sequencing DNA of 
 the ultra-purified VLPs in the sample. VLP ultra-purification is generally done 
 using CsCl density-gradient ultracentrifugation. Additionally, it is very 
 important that the VLP-fraction is treated with DNase to remove free DNA! After 
 sequencing, reads from the whole-community fraction are assembled and both the 
 whole-community and VLP-fraction reads are mapped to the assembly. **Read** 
 **mapping should be performed using a high minimum identity (0.97 or higher)** 
 **and random mapping of ambiguous reads.** The pileup files needed for TrIdent 
-are generated using .bam files produced during read mapping. 
+are generated using the .bam files produced during read mapping. 
 
 **Deep sequencing of the whole-community and VLP-fractions is needed for** 
-**transductomics! Sample preparation, sequencing procedures, and analysis** 
-**methods are detailed in** 
+**transductomics! Sample preparation, sequencing procedures, and** 
+**bioinformatics methods are detailed in** 
 **[Kleiner et al.(2020)](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00935-5).**
 
 ## Pileup files
@@ -142,10 +142,9 @@ parameter/argument. **Otherwise, BBMap's**
 TrIdent requires two pileup files from a transductomics dataset as input:
 
 * A VLP-fraction pileup: Sequencing reads from a sample's ultra-purified 
-VLP-fraction mapped to the whole-community metagenome assembly from the same 
-sample.
+VLP-fraction mapped to the sample's whole-community metagenome assembly.
 * A whole-community pileup: Sequencing reads from a sample's whole-community 
-mapped to the whole-community metagenome from the same sample. 
+mapped to the sample's whole-community metagenome assembly. 
 
 **Remember- The data used for each pileup file must originate from the same** 
 **sample. Pileup files must use a 100 bp window/bin size for the rolling** 
@@ -173,7 +172,7 @@ kable(head(VLPFractionSamplePileup), row.names = FALSE) %>%
 
 `TrIdentClassifier()` is the main function in TrIdent. This function filters 
 contigs based on length and read coverage, performs pattern-matching to 
-classify contigs, identifies active/highly abundant and heterogenously 
+classify contigs, identifies highly active/abundant and heterogenously 
 integrated Prophage-like elements, determines which contigs have high 
 VLP-fraction:whole-community read coverage ratios, identifies start and stop 
 positions and sizes of pattern-matches, and calculates slopes for Sloping 
@@ -233,27 +232,28 @@ associated with the best match-score is used for contig classification. Contigs
 are classified as ‘Prophage-like’, ‘Sloping’, or 'NoPattern' during 
 pattern-matching. 
 
-#### Patterns
+#### **Patterns**
 
-##### **Sloping:**
+##### Sloping:
 There are four sloping pattern variations in the sloping pattern class. The 
 sloping patterns are representative of large DNA transfers that take place 
-during generalized, lateral and GTA transduction due to the decreasing 
-frequency of DNA packaging moving away from the packaging initiation sites. 
-During pattern-matching, the slope values of the sloping patterns are decreased 
-until a minimum slope of 0.001 (change of 10x read coverage over 10,000 bp) is 
-reached. The minimum slope value can be changed with the `minSlope` parameter. 
-Generalized, lateral and GTA transduction events can span tens to hundreds of 
-kilobasepairs of DNA and a single contig typically does not capture an entire 
-event. Depending on which part of the transducing event is captured by the 
-contig, the sloping can be very severe or almost 0. Patterns 1 and 2 represent 
-contigs that capture a Sloping transducing event somewhere in the middle of the 
-DNA transfer. Patterns 2 and 4 represent contigs that capture the jump of read 
-coverage associated with packaging initiation of a Sloping transducing event. 
-Patterns 2 and 4 are translated across the contig in addition to having the 
-slopes changed while only the slopes are changed on patterns 1 and 2.
-
-```{r echo=FALSE, fig.width=4,fig.height=5}
+during generalized, lateral and gene transfer agent (GTA) transduction due to 
+the decreasing frequency of DNA packaging moving away from the packaging 
+initiation sites. During pattern-matching, the slope values of the sloping 
+patterns are decreased until a minimum slope of 0.001 (change of 10x read 
+coverage over 10,000 bp) is reached. The minimum slope value can be changed 
+with the `minSlope` parameter. Generalized, lateral and GTA transduction events 
+can span tens to hundreds of kilobasepairs of DNA and a single contig typically 
+does not capture an entire event. Depending on which part of the transducing 
+event is captured by the contig, the sloping can be very severe or almost 0. 
+Patterns 1 and 2 represent contigs that capture a Sloping transducing event 
+somewhere in the middle of the DNA transfer. Patterns 2 and 4 represent contigs 
+that capture the jump of read coverage associated with packaging initiation 
+site of a Sloping transducing event. Patterns 2 and 4 are translated across the 
+contig in addition to having the slopes changed while only the slopes are 
+changed on patterns 1 and 2.
+
+```{r echo=FALSE, fig.width=4,fig.height=4}
 dataframe <- cbind.data.frame(c(1:100), c(seq(1, 100, 1)))
 colnames(dataframe) <- c("mockpos", "mockcov")
 plot1 <- ggplot(dataframe, aes(x = mockpos, y = mockcov)) +
@@ -317,7 +317,7 @@ plot4 <- ggplot(dataframe, aes(x = mockpos, y = mockcov)) +
 plot1 + plot2 + plot3 + plot4
 ```
 
-##### **Prophage-like:**
+##### Prophage-like:
 There are three block patterns in the Prophage-like pattern class. The block 
 patterns are representative of integrated genetic elements that can be excised 
 from the host chromosome and mobilized. The blocks of read coverage that define 
@@ -330,7 +330,7 @@ block patterns are altered and all pattern variations are translated across the
 contig. The block pattern widths never get smaller than 10,000 bp by default, 
 however this can be changed with the `minBlockSize` parameter. Pattern 1 
 represents a Prophage-like element that is entirely on the contig while 
-patterns 2 and 3 repesent Prophage-like elements that trail off the right or 
+patterns 2 and 3 represent Prophage-like elements that trail off the right or 
 left side of the contig, respectively. While a Prophage-like classification is 
 not an example of transduction by itself, there may be transduction associated 
 with Prophage-like classifications. The improper excision of Prophage-like 
@@ -389,21 +389,21 @@ plot3 + plot1 + plot2
 ```
 
 
-##### **noPattern:**
+##### noPattern:
 Since the best pattern-match for each contig is determined by comparing 
 match-scores amongst all pattern-variations from all pattern classes, we needed 
 a ‘negative control’ pattern to compare against. The 'NoPattern' 'pattern' 
 serves as a negative control by matching to contigs with no read coverage 
 patterns. We made two NoPattern patterns which consist of a horizontal line the 
 same length as the contig being assessed at either the average or median read 
-coverage for a contig. This pattern is not altered or translated in any way. 
+coverage for a contig. This pattern is not re-scaled or translated in any way. 
 Note that read coverage patterns are heavily dependent on the depth of read 
 coverage achieved during sequencing and therefore very rare transduction events 
 may not achieve sufficient read coverage for detection with read coverage 
 pattern-matching. Rather than label contigs with no read coverage pattern as 
 having ‘no transduction’, we instead label them as having 'no pattern'.
 
-```{r echo=FALSE, fig.width=2,fig.height=2}
+```{r echo=FALSE, fig.width=4,fig.height=4}
 dataframe <- cbind.data.frame(c(1:100), rep(10, 100))
 colnames(dataframe) <- c("mockpos", "mockcov")
 plot1 <- ggplot(dataframe, aes(x = mockpos, y = mockcov)) +
@@ -416,15 +416,14 @@ plot1 <- ggplot(dataframe, aes(x = mockpos, y = mockcov)) +
         axis.ticks.x = element_blank(),
         axis.text.y = element_blank(),
         axis.ticks.y = element_blank(),
-        plot.title = element_text(size = 6),
+        plot.title = element_text(size = 10),
         panel.border = element_rect(colour = "black", fill = NA, linewidth = 2)
     )
 plot1
 ```
 
 
-### Highly active/abundant and heterogenously integrated/present Prophage-like 
-elements
+### Highly active/abundant and heterogenously integrated/present Prophage-like elements
 
 Prophage-like elements that are actively replicating or are highly abundant 
 will typically generate more sequencing reads than the rest of their host 
@@ -433,18 +432,17 @@ the element's insertion site in the whole-community fraction read coverage.
 Conversely, if a Prophage-like element is integrated into only a portion of the 
 host bacterial population, there may be a dip or depression in read coverage at 
 the integration site in the whole-community read coverage. In order to 
-determine the whole-community read coverage elevation or depression of a 
-Prophage-like element, one must know its genomic location. While some tools 
-rely on annotation information to identify Prophage-like elements in 
-whole-community metagenomes, `TrIdentClassifier()` uses the VLP-fraction read 
-coverage patterns. The locations of Prophage-like pattern-matches are used to 
-calculate the Prophage-like:non-Prophage-like whole-community read coverage 
-ratio. Prophage-like patterns with whole-community read coverage ratios greater 
-than 1.15 are labeled as 'elevated' while ratios less then 0.75 are labeled as 
-'depressed'. 
-
-### NoPattern classifications with high VLP-fraction:whole-community read 
-coverage ratios
+determine if the whole-community read coverage is elevated or depressed at the 
+site of a Prophage-like element, one must know the Prophage-like element's 
+genomic location. While some tools rely on annotation information to identify 
+Prophage-like elements in whole-community metagenomes, `TrIdentClassifier()` 
+uses the VLP-fraction read coverage patterns. The locations of Prophage-like 
+pattern-matches are used to calculate the Prophage-like:non-Prophage-like 
+whole-community read coverage ratio. Prophage-like patterns with 
+whole-community read coverage ratios greater than 1.15 are labeled as 'elevated' 
+while ratios less then 0.75 are labeled as 'depressed'. 
+
+### NoPattern classifications with high VLP-fraction:whole-community read coverage ratios
 
 If a contig receives a noPattern classification, it proceeds to an additional 
 classification step which either leaves the classification as is or 
@@ -465,7 +463,7 @@ transformation) produces relatively even read coverage patterns when purified
 MV sequencing reads are mapped back to their bacterial chromosome of origin 
 (Faddetta et al., 2022). Lastly, phage genomes that have assembled into contigs 
 in the whole-community fraction may generate high levels of even read coverage 
-as the phage reads in the VLP-fraction map back to its own genome sequence. 
+as the phage reads in the VLP-fraction map back to their own genome sequences. 
 Contigs with median VLP-fraction:Whole-community read coverage ratios greater 
 than 2, in other words contigs where the median VLP-fraction read coverage 
 value is 2x the whole-community median read coverage value, are re-classified 
@@ -499,7 +497,7 @@ TrIdentClassifier(VLPpileup, WCpileup,
 * `windowSize`: The number of basepairs to average read coverage values over. 
 Options are 100, 200, 500, 1000 ONLY.  Default is 1000.
 * `minBlockSize`: The minimum size (in bp) of the Prophage-like block pattern. 
-Default is 10000.
+Default is 10000. Must be greater than 1000.
 * `maxBlockSize`: The maximum size (in bp) of the Prophage-like block pattern. 
 Default is NA (no maximum).
 * `minContigLength`: The minimum contig size (in bp) to perform 
@@ -531,8 +529,7 @@ as Prophage-like, Sloping and HighCovNoPattern.
 3.  PatternMatchInfo: A list of pattern-match information that is used by other 
 functions in TrIdent.
 4.  FilteredOutContigTable: A table containing contigs that were filtered out 
-and the reason why (low read coverage or 
-too short).
+and the reason why (low read coverage or too short).
 5.  windowSize: The `windowSize` used.
 
 Save the desired list-item to a new variable using its associated name.
@@ -551,16 +548,15 @@ kable(TrIdentSummaryTable) %>%
 
 `plotTrIdentResults()` allows users to visualize both the whole-community and 
 VLP-fraction read coverage and the pattern-match associated with each contig 
-classified as Prophge-like, Sloping and HighCovNoPattern.  
+classified as Prophage-like, Sloping and HighCovNoPattern.  
 
 ## Function components
 
 ### Re-building pattern-matches
 The `TrIdentClassifier()` output contains information needed to re-build each 
 pattern-match used for contig classification. To re-build a complete 
 pattern-match for visualization, `plotTrIdentResults()` uses the 
-pattern-match's minimum and maximum values, start and stop positions, and for 
-Sloping classifications, the 'slope' value. 
+pattern-match's minimum and maximum values and the start and stop positions.
 
 ### Plotting read coverage and associated pattern-matches
 The whole-community and VLP-fraction read coverage are plotted for each contig 
@@ -595,7 +591,7 @@ plotTrIdentResults(
 match-scores. A suggested filtering threshold is provided by 
 `TrIdentClassifier()` if `suggFiltThresh=TRUE`.
 * `saveFilesTo`: Optional, Provide a path to the directory you wish to save 
-output to. A folder will be made within  the provided directory to store 
+output to. A folder will be made within the provided directory to store 
 results.
 
 ## Output
@@ -604,6 +600,11 @@ The output of `plotTrIdentResults()` is a list containing ggplot objects. The
 list contains all read coverage plots for contigs classified as Sloping, 
 Prophage-like, or HighCovNoPattern and their respective pattern-matches.
 
+**By default, the plots are displayed with raw read coverage values. We 
+recommend that users also view plots using `logScale=TRUE` as some specialized 
+transduction patterns occur at such low frequencies they can only be visualized
+using log scaled read coverage values.** 
+
 View all plots:
 ```{r eval=FALSE}
 TrIdentPlots
@@ -719,9 +720,10 @@ specializedTransductionID(VLPpileup, TrIdentResults,
 * `VLPpileup`: VLP-fraction pileup file.
 * `TrIdentResults`: The output from `TrIdentClassifier()`.
 * `noReadCov`:  Number of basepairs of zero read coverage encountered before 
-specialized transduction searching stops. Default is 500.
+specialized transduction searching stops. Default is 500. Must be at least 100.
 * `specTransLength`: Number of basepairs of non-zero read coverage needed for 
-specialized transduction to be considered. Default is 2000.
+specialized transduction to be considered. Default is 2000. Must be at least 
+100.
 * `logScale`: TRUE or FALSE, display VLP-fraction read coverage in log10 scale. 
 Default is FALSE.
 * `matchScoreFilter`: Optional, Filter plots using the normalized pattern