Skip to content

Commit

Permalink
Final polishing updates
Browse files Browse the repository at this point in the history
  • Loading branch information
jlmaier12 committed Dec 15, 2024
1 parent 070e86c commit a75b768
Show file tree
Hide file tree
Showing 5 changed files with 76 additions and 63 deletions.
4 changes: 2 additions & 2 deletions R/TrIdentClassifier.R
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
#' @param windowSize The number of basepairs to average read coverage values
#' over. Options are 100, 200, 500, 1000 ONLY. Default is 1000.
#' @param minBlockSize The minimum size (in bp) of the Prophage-like block
#' pattern. Default is 10000.
#' pattern. Default is 10000. Must be at least 1000.
#' @param maxBlockSize The maximum size (in bp) of the Prophage-like block
#' pattern. Default is NA (no maximum).
#' @param minContigLength The minimum contig size (in bp) to perform
Expand Down Expand Up @@ -52,7 +52,7 @@ TrIdentClassifier <- function(VLPpileup,
if (!(windowSize %in% list(100, 200, 500, 1000))) {
stop("windowSize must be either 100, 200, 500, or 1000 bp!")
}
if (minContigLength < 25000) {
if (minContigLength <= 25000) {
stop("minContigLength must be at least 25,000 bp for pattern-matching!")
}
if (minBlockSize <= 1000) {
Expand Down
13 changes: 11 additions & 2 deletions R/specializedTransductionID.R
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,11 @@
#' @param VLPpileup VLP-fraction pileup file.
#' @param TrIdentResults Output from `TrIdentClassifier()`
#' @param noReadCov Number of basepairs of zero read coverage encountered before
#' specialized transduction searching stops. Default is 500.
#' specialized transduction searching stops. Default is 500. Must be at least
#' 100.
#' @param specTransLength Number of basepairs of non-zero read coverage needed
#' for specialized transduction to be considered. Default is 2000.
#' for specialized transduction to be considered. Default is 2000. Must be
#' at least 100.
#' @param specificContig Optional, Search a specific contig classified as
#' Prophage-like ("NODE_1").
#' @param matchScoreFilter Optional, Filter plots using the normalized pattern
Expand Down Expand Up @@ -48,6 +50,13 @@ specializedTransductionID <- function(VLPpileup,
matchScoreFilter,
logScale = FALSE,
SaveFilesTo) {
## error catching
if (noReadCov < 100) {
stop("noReadCov must be greater than or equal to 100!")
}
if (specTransLength < 100) {
stop("specTransLength must be greater than or equal to 100!")
}
specTransInfo <- NULL
TrIdentResultPatterns <- TrIdentResults[[3]]
TrIdentResultSumm <- TrIdentResults[[1]]
Expand Down
2 changes: 1 addition & 1 deletion man/TrIdentClassifier.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 4 additions & 2 deletions man/specializedTransductionID.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

114 changes: 58 additions & 56 deletions vignettes/TrIdent-vignette.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -105,19 +105,19 @@ Transductomics allows for the identification of bacterial DNA being actively
carried or transduced by VLPs. A transductomics dataset consists of two parts-
metagenomes from the whole-community and VLP fractions of a sample. The
whole-community fraction is generated by extracting and sequencing DNA from the
whole sample. The VLP-fraction is generated by DNA extraction and sequencing of
whole sample. The VLP-fraction is generated by extraction and sequencing DNA of
the ultra-purified VLPs in the sample. VLP ultra-purification is generally done
using CsCl density-gradient ultracentrifugation. Additionally, it is very
important that the VLP-fraction is treated with DNase to remove free DNA! After
sequencing, reads from the whole-community fraction are assembled and both the
whole-community and VLP-fraction reads are mapped to the assembly. **Read**
**mapping should be performed using a high minimum identity (0.97 or higher)**
**and random mapping of ambiguous reads.** The pileup files needed for TrIdent
are generated using .bam files produced during read mapping.
are generated using the .bam files produced during read mapping.

**Deep sequencing of the whole-community and VLP-fractions is needed for**
**transductomics! Sample preparation, sequencing procedures, and analysis**
**methods are detailed in**
**transductomics! Sample preparation, sequencing procedures, and**
**bioinformatics methods are detailed in**
**[Kleiner et al.(2020)](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00935-5).**

## Pileup files
Expand All @@ -142,10 +142,9 @@ parameter/argument. **Otherwise, BBMap's**
TrIdent requires two pileup files from a transductomics dataset as input:

* A VLP-fraction pileup: Sequencing reads from a sample's ultra-purified
VLP-fraction mapped to the whole-community metagenome assembly from the same
sample.
VLP-fraction mapped to the sample's whole-community metagenome assembly.
* A whole-community pileup: Sequencing reads from a sample's whole-community
mapped to the whole-community metagenome from the same sample.
mapped to the sample's whole-community metagenome assembly.

**Remember- The data used for each pileup file must originate from the same**
**sample. Pileup files must use a 100 bp window/bin size for the rolling**
Expand Down Expand Up @@ -173,7 +172,7 @@ kable(head(VLPFractionSamplePileup), row.names = FALSE) %>%

`TrIdentClassifier()` is the main function in TrIdent. This function filters
contigs based on length and read coverage, performs pattern-matching to
classify contigs, identifies active/highly abundant and heterogenously
classify contigs, identifies highly active/abundant and heterogenously
integrated Prophage-like elements, determines which contigs have high
VLP-fraction:whole-community read coverage ratios, identifies start and stop
positions and sizes of pattern-matches, and calculates slopes for Sloping
Expand Down Expand Up @@ -233,27 +232,28 @@ associated with the best match-score is used for contig classification. Contigs
are classified as ‘Prophage-like’, ‘Sloping’, or 'NoPattern' during
pattern-matching.

#### Patterns
#### **Patterns**

##### **Sloping:**
##### Sloping:
There are four sloping pattern variations in the sloping pattern class. The
sloping patterns are representative of large DNA transfers that take place
during generalized, lateral and GTA transduction due to the decreasing
frequency of DNA packaging moving away from the packaging initiation sites.
During pattern-matching, the slope values of the sloping patterns are decreased
until a minimum slope of 0.001 (change of 10x read coverage over 10,000 bp) is
reached. The minimum slope value can be changed with the `minSlope` parameter.
Generalized, lateral and GTA transduction events can span tens to hundreds of
kilobasepairs of DNA and a single contig typically does not capture an entire
event. Depending on which part of the transducing event is captured by the
contig, the sloping can be very severe or almost 0. Patterns 1 and 2 represent
contigs that capture a Sloping transducing event somewhere in the middle of the
DNA transfer. Patterns 2 and 4 represent contigs that capture the jump of read
coverage associated with packaging initiation of a Sloping transducing event.
Patterns 2 and 4 are translated across the contig in addition to having the
slopes changed while only the slopes are changed on patterns 1 and 2.

```{r echo=FALSE, fig.width=4,fig.height=5}
during generalized, lateral and gene transfer agent (GTA) transduction due to
the decreasing frequency of DNA packaging moving away from the packaging
initiation sites. During pattern-matching, the slope values of the sloping
patterns are decreased until a minimum slope of 0.001 (change of 10x read
coverage over 10,000 bp) is reached. The minimum slope value can be changed
with the `minSlope` parameter. Generalized, lateral and GTA transduction events
can span tens to hundreds of kilobasepairs of DNA and a single contig typically
does not capture an entire event. Depending on which part of the transducing
event is captured by the contig, the sloping can be very severe or almost 0.
Patterns 1 and 2 represent contigs that capture a Sloping transducing event
somewhere in the middle of the DNA transfer. Patterns 2 and 4 represent contigs
that capture the jump of read coverage associated with packaging initiation
site of a Sloping transducing event. Patterns 2 and 4 are translated across the
contig in addition to having the slopes changed while only the slopes are
changed on patterns 1 and 2.

```{r echo=FALSE, fig.width=4,fig.height=4}
dataframe <- cbind.data.frame(c(1:100), c(seq(1, 100, 1)))
colnames(dataframe) <- c("mockpos", "mockcov")
plot1 <- ggplot(dataframe, aes(x = mockpos, y = mockcov)) +
Expand Down Expand Up @@ -317,7 +317,7 @@ plot4 <- ggplot(dataframe, aes(x = mockpos, y = mockcov)) +
plot1 + plot2 + plot3 + plot4
```

##### **Prophage-like:**
##### Prophage-like:
There are three block patterns in the Prophage-like pattern class. The block
patterns are representative of integrated genetic elements that can be excised
from the host chromosome and mobilized. The blocks of read coverage that define
Expand All @@ -330,7 +330,7 @@ block patterns are altered and all pattern variations are translated across the
contig. The block pattern widths never get smaller than 10,000 bp by default,
however this can be changed with the `minBlockSize` parameter. Pattern 1
represents a Prophage-like element that is entirely on the contig while
patterns 2 and 3 repesent Prophage-like elements that trail off the right or
patterns 2 and 3 represent Prophage-like elements that trail off the right or
left side of the contig, respectively. While a Prophage-like classification is
not an example of transduction by itself, there may be transduction associated
with Prophage-like classifications. The improper excision of Prophage-like
Expand Down Expand Up @@ -389,21 +389,21 @@ plot3 + plot1 + plot2
```


##### **noPattern:**
##### noPattern:
Since the best pattern-match for each contig is determined by comparing
match-scores amongst all pattern-variations from all pattern classes, we needed
a ‘negative control’ pattern to compare against. The 'NoPattern' 'pattern'
serves as a negative control by matching to contigs with no read coverage
patterns. We made two NoPattern patterns which consist of a horizontal line the
same length as the contig being assessed at either the average or median read
coverage for a contig. This pattern is not altered or translated in any way.
coverage for a contig. This pattern is not re-scaled or translated in any way.
Note that read coverage patterns are heavily dependent on the depth of read
coverage achieved during sequencing and therefore very rare transduction events
may not achieve sufficient read coverage for detection with read coverage
pattern-matching. Rather than label contigs with no read coverage pattern as
having ‘no transduction’, we instead label them as having 'no pattern'.

```{r echo=FALSE, fig.width=2,fig.height=2}
```{r echo=FALSE, fig.width=4,fig.height=4}
dataframe <- cbind.data.frame(c(1:100), rep(10, 100))
colnames(dataframe) <- c("mockpos", "mockcov")
plot1 <- ggplot(dataframe, aes(x = mockpos, y = mockcov)) +
Expand All @@ -416,15 +416,14 @@ plot1 <- ggplot(dataframe, aes(x = mockpos, y = mockcov)) +
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(size = 6),
plot.title = element_text(size = 10),
panel.border = element_rect(colour = "black", fill = NA, linewidth = 2)
)
plot1
```


### Highly active/abundant and heterogenously integrated/present Prophage-like
elements
### Highly active/abundant and heterogenously integrated/present Prophage-like elements

Prophage-like elements that are actively replicating or are highly abundant
will typically generate more sequencing reads than the rest of their host
Expand All @@ -433,18 +432,17 @@ the element's insertion site in the whole-community fraction read coverage.
Conversely, if a Prophage-like element is integrated into only a portion of the
host bacterial population, there may be a dip or depression in read coverage at
the integration site in the whole-community read coverage. In order to
determine the whole-community read coverage elevation or depression of a
Prophage-like element, one must know its genomic location. While some tools
rely on annotation information to identify Prophage-like elements in
whole-community metagenomes, `TrIdentClassifier()` uses the VLP-fraction read
coverage patterns. The locations of Prophage-like pattern-matches are used to
calculate the Prophage-like:non-Prophage-like whole-community read coverage
ratio. Prophage-like patterns with whole-community read coverage ratios greater
than 1.15 are labeled as 'elevated' while ratios less then 0.75 are labeled as
'depressed'.

### NoPattern classifications with high VLP-fraction:whole-community read
coverage ratios
determine if the whole-community read coverage is elevated or depressed at the
site of a Prophage-like element, one must know the Prophage-like element's
genomic location. While some tools rely on annotation information to identify
Prophage-like elements in whole-community metagenomes, `TrIdentClassifier()`
uses the VLP-fraction read coverage patterns. The locations of Prophage-like
pattern-matches are used to calculate the Prophage-like:non-Prophage-like
whole-community read coverage ratio. Prophage-like patterns with
whole-community read coverage ratios greater than 1.15 are labeled as 'elevated'
while ratios less then 0.75 are labeled as 'depressed'.

### NoPattern classifications with high VLP-fraction:whole-community read coverage ratios

If a contig receives a noPattern classification, it proceeds to an additional
classification step which either leaves the classification as is or
Expand All @@ -465,7 +463,7 @@ transformation) produces relatively even read coverage patterns when purified
MV sequencing reads are mapped back to their bacterial chromosome of origin
(Faddetta et al., 2022). Lastly, phage genomes that have assembled into contigs
in the whole-community fraction may generate high levels of even read coverage
as the phage reads in the VLP-fraction map back to its own genome sequence.
as the phage reads in the VLP-fraction map back to their own genome sequences.
Contigs with median VLP-fraction:Whole-community read coverage ratios greater
than 2, in other words contigs where the median VLP-fraction read coverage
value is 2x the whole-community median read coverage value, are re-classified
Expand Down Expand Up @@ -499,7 +497,7 @@ TrIdentClassifier(VLPpileup, WCpileup,
* `windowSize`: The number of basepairs to average read coverage values over.
Options are 100, 200, 500, 1000 ONLY. Default is 1000.
* `minBlockSize`: The minimum size (in bp) of the Prophage-like block pattern.
Default is 10000.
Default is 10000. Must be greater than 1000.
* `maxBlockSize`: The maximum size (in bp) of the Prophage-like block pattern.
Default is NA (no maximum).
* `minContigLength`: The minimum contig size (in bp) to perform
Expand Down Expand Up @@ -531,8 +529,7 @@ as Prophage-like, Sloping and HighCovNoPattern.
3. PatternMatchInfo: A list of pattern-match information that is used by other
functions in TrIdent.
4. FilteredOutContigTable: A table containing contigs that were filtered out
and the reason why (low read coverage or
too short).
and the reason why (low read coverage or too short).
5. windowSize: The `windowSize` used.

Save the desired list-item to a new variable using its associated name.
Expand All @@ -551,16 +548,15 @@ kable(TrIdentSummaryTable) %>%

`plotTrIdentResults()` allows users to visualize both the whole-community and
VLP-fraction read coverage and the pattern-match associated with each contig
classified as Prophge-like, Sloping and HighCovNoPattern.
classified as Prophage-like, Sloping and HighCovNoPattern.

## Function components

### Re-building pattern-matches
The `TrIdentClassifier()` output contains information needed to re-build each
pattern-match used for contig classification. To re-build a complete
pattern-match for visualization, `plotTrIdentResults()` uses the
pattern-match's minimum and maximum values, start and stop positions, and for
Sloping classifications, the 'slope' value.
pattern-match's minimum and maximum values and the start and stop positions.

### Plotting read coverage and associated pattern-matches
The whole-community and VLP-fraction read coverage are plotted for each contig
Expand Down Expand Up @@ -595,7 +591,7 @@ plotTrIdentResults(
match-scores. A suggested filtering threshold is provided by
`TrIdentClassifier()` if `suggFiltThresh=TRUE`.
* `saveFilesTo`: Optional, Provide a path to the directory you wish to save
output to. A folder will be made within the provided directory to store
output to. A folder will be made within the provided directory to store
results.

## Output
Expand All @@ -604,6 +600,11 @@ The output of `plotTrIdentResults()` is a list containing ggplot objects. The
list contains all read coverage plots for contigs classified as Sloping,
Prophage-like, or HighCovNoPattern and their respective pattern-matches.

**By default, the plots are displayed with raw read coverage values. We
recommend that users also view plots using `logScale=TRUE` as some specialized
transduction patterns occur at such low frequencies they can only be visualized
using log scaled read coverage values.**

View all plots:
```{r eval=FALSE}
TrIdentPlots
Expand Down Expand Up @@ -719,9 +720,10 @@ specializedTransductionID(VLPpileup, TrIdentResults,
* `VLPpileup`: VLP-fraction pileup file.
* `TrIdentResults`: The output from `TrIdentClassifier()`.
* `noReadCov`: Number of basepairs of zero read coverage encountered before
specialized transduction searching stops. Default is 500.
specialized transduction searching stops. Default is 500. Must be at least 100.
* `specTransLength`: Number of basepairs of non-zero read coverage needed for
specialized transduction to be considered. Default is 2000.
specialized transduction to be considered. Default is 2000. Must be at least
100.
* `logScale`: TRUE or FALSE, display VLP-fraction read coverage in log10 scale.
Default is FALSE.
* `matchScoreFilter`: Optional, Filter plots using the normalized pattern
Expand Down

0 comments on commit a75b768

Please sign in to comment.