Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hg38 support #6

Draft
wants to merge 81 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
37fcb5a
Made ChrConverter more flexible (second hard-coded reference) and cen…
vinjana Dec 13, 2023
703ef68
Build script for sophiaMref.
vinjana Dec 13, 2023
769cce8
Made README more descriptive with texts from Umut's and the SophiaWor…
vinjana Dec 13, 2023
0a9e8f2
More code layout changes. Removed temporary file.
vinjana Dec 13, 2023
a5ebdb6
Removed using namespace std. More code layout changes.
vinjana Dec 13, 2023
57f8be4
Documentation for hg37 chromosome name parser.
vinjana Dec 13, 2023
f4f4dce
Replace obscure -2 into NA variable to get more self-descriptive code.
vinjana Dec 13, 2023
cbcf2dd
Major refactoring to have a map-based hg38 index/converter.
vinjana Dec 15, 2023
fee16d2
Fixed some compile instructions for static compile.
vinjana Dec 15, 2023
5904ab0
Compile fix for static libraries with Makefile.
vinjana Dec 15, 2023
9a894b6
Fixed many errors, but not all yet.
vinjana Dec 20, 2023
fec30e9
Fixed all apparent compile errors. Some semantic errors as warning re…
vinjana Dec 20, 2023
e7100ed
Made compressedMrefIndexToIndex() return an optional to account for f…
vinjana Dec 20, 2023
e12fe90
Makefile simplified. Static compile still doesn't work, though.
vinjana Dec 20, 2023
944fe11
Fully implemented dynamic and static comilation with Makefile. Releas…
vinjana Dec 21, 2023
31ca645
Merge pull request #7 from DKFZ-ODCF/static-compile
vinjana Dec 21, 2023
aca3c6a
Encapsulate implementation detail into ChrConverter.isIgnoredChromoso…
vinjana Dec 22, 2023
f9419f8
Removed unnecessary imports. Wrap main functions into try-catch to ca…
vinjana Dec 22, 2023
e1a0649
Fixes after coderabbit review.
vinjana Dec 22, 2023
dfbacc2
Switched to parse chromosome name function taking start and end itera…
vinjana Dec 22, 2023
6428131
Fixed input check and improved error message. Changed `make STATIC=tr…
vinjana Dec 22, 2023
ee9d46a
Fixed cornercase with supplementary alignment tag (SA:Z:) containing …
vinjana Jan 8, 2024
2073adf
Minor.
vinjana Jan 8, 2024
3476b32
Added comment.
vinjana Jan 8, 2024
4e017b8
Improve chrName parser for breakpoint files, to be able to deal with …
vinjana Jan 11, 2024
c2d56dc
Added googletest/gtest-based unit tests for the chromosome name parse…
vinjana Jan 12, 2024
dc6acf0
Fixed Makefile to build static binaries, except the testRunner (which…
vinjana Jan 12, 2024
bf4d4d6
Fixed wrong initialiazation order of GlobalAppConfig.
vinjana Jan 12, 2024
78213ee
Bugs fixed, but parser of breakpoints still broken.
vinjana Jan 15, 2024
92d012e
Added boost::stacktrace for better error reporting.
vinjana Jan 15, 2024
55ba762
Documented output BED file and code.
vinjana Jan 16, 2024
2817ae2
Added failing Breakpoint parser tests.
vinjana Jan 16, 2024
adeace4
Hack to prevent failure in tests due to reinitialized singleton.
vinjana Jan 16, 2024
d14705a
Added documentation for sophiaMref output file. Added `binaries` targ…
vinjana Jan 17, 2024
cd83acf
Added tests. Removed TODO.
vinjana Jan 17, 2024
b9e69e6
Added contig classes, also to hg38 (not working yet). Range approach.
vinjana Jan 17, 2024
39e46c0
Migrated Hg38ChrConverter to one that reads configuration file. Yet h…
vinjana Jan 23, 2024
4190cfc
Transformed the Hg38ChrConverter into a simple GenericChrConverter th…
vinjana Jan 23, 2024
b66c253
Changed from hg37 to classic_hg37 as default Hg37ChrConverter.
vinjana Jan 23, 2024
da6fc36
Removed obsolete IndexRange. Fixed is$category methods for classic_hg…
vinjana Jan 24, 2024
604ca78
Much improved error messages for parsing.
vinjana Jan 24, 2024
3917745
Some minor exception tweaks.
vinjana Jan 24, 2024
fabd143
Fixed a bug I intruduced. Noted that SuppAlignmentAnno::SuppAlignment…
vinjana Jan 24, 2024
b5ba8ed
Comments and removed strange 'm' printed by `sophiaAnnotate`.
vinjana Jan 24, 2024
c0158d7
Added CONTRIBUTORS.md.
vinjana Jan 24, 2024
e027b40
README update. Added hs37d5+phix.tsv to resources.
vinjana Jan 24, 2024
47eac31
Fix chromosome label.
vinjana Jan 25, 2024
cb551f3
Fixed some chromosome index mappings that could have been entirely av…
vinjana Jan 29, 2024
5aac8c8
Major refactorings. Have unsigned and signed int for ChrIndex and Com…
vinjana Jan 31, 2024
da58046
Commandline fixes created by refactoring.
vinjana Feb 1, 2024
7b7d7ec
Switched types for ChrIndex and CompressedMrefIndex and fixed errors.…
vinjana Feb 1, 2024
aabceca
Comments and layout changes.
vinjana Feb 1, 2024
355f377
Reviewed my own code, to find any bugs.
vinjana Feb 5, 2024
6c58604
Added assertions (can be turned off with -DNDEBUG for production) to …
vinjana Feb 6, 2024
e55b4a5
Refined tests for valid chromosome indices (classic_hg37), at constru…
vinjana Feb 6, 2024
3d54031
Fixed check.
vinjana Feb 6, 2024
8c86dd8
Small refactorings
vinjana Feb 6, 2024
7c07a52
Refactored ChrCategory, wrote some tests, and fixed some issues.
vinjana Feb 6, 2024
7ba7053
Changed all index and position types back to int-length. MrefEntry an…
vinjana Feb 6, 2024
2b59b12
Changed MrefEntry::validity into signed char, to reduce space require…
vinjana Feb 7, 2024
50860d0
Little refactoring for code readability.
vinjana Feb 7, 2024
9c0d5df
Fixed the memory issue.
vinjana Feb 7, 2024
b9926db
Removed all assertions, to see whether that fixes the memory issue.
vinjana Feb 13, 2024
78fda5e
Small changes.
vinjana Feb 14, 2024
e230e17
Added test for internal static constructor function of Hg37ChrConverter.
vinjana Feb 14, 2024
5022eda
Removed assertValid calls completely. Instead now use IndexRange class.
vinjana Feb 14, 2024
9e013ce
Minor.
vinjana Feb 20, 2024
d2bb840
Preallocate full memory for MasterMrefProcessor (a lot), rather than …
vinjana Mar 11, 2024
450ccc7
Some small refactorings, edits, and comments.
vinjana Mar 13, 2024
dc7f525
Added tests for SuppAlignmentAnno.
vinjana Mar 14, 2024
600ffdd
Some debugging output to catch a specific difference
vinjana Mar 15, 2024
b673eca
Fixed some incorrectly translated conditions concerning decoys.
vinjana Mar 15, 2024
4721b35
Split gonosomes into X and Y classes. One condition (CompressedMrefIn…
vinjana Mar 18, 2024
88d4715
Made assemblyName a non-static, const value in ChrConverter.
vinjana Mar 18, 2024
7d2e571
Calculate memory allocated by MasterRefProcessor and report in GB.
vinjana Mar 18, 2024
9a00c6b
Minor.
vinjana Mar 18, 2024
f1c8209
Answers to CodeRabbit.
vinjana Mar 18, 2024
e65cfbf
Fixed incorrect index returned from parse function.
vinjana Mar 18, 2024
95089fe
Merge pull request #22 from DKFZ-ODCF/review-my-code
vinjana Mar 18, 2024
a8ac0ae
Adapted SvEvent conditions again to make them make similar to dealing…
vinjana Mar 18, 2024
3c14f17
Removed all using namespace std.
vinjana Mar 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
vinjana marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@
*.o
Release_sophia/sophia
Release_sophiaAnnotate/sophiaAnnotate
Release_sophiaMref/sophiaMref
include/strtk.hpp
41 changes: 40 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,30 @@
# SOPHIA Tool for Structural Variation Calling

SOPHIA is a Structural Variant(SV) detection algorithm based on the supplementary alignment(SA) concept of the aligner BWA-mem, combined with filters based on expert-knowledge to increase specificity.

Currently, SOPHIA only is optimized for the hg38 assembly of the human genome.
It uses a large panel of normals for filtering artifacts (most often due to mapping difficulties) and common SVs in the germline.
The parameters for filtering results are hand-tuned against the clinical gold standard FISH of V(D)J rearrangements.
Results from the hand-tuned parameter set were tested against hallmark findings from disease datasets where hallmark SVs were known (CDKN2A in various TCGA datasets, EGFR in TCGA-GBM, GFI1B, MYCN and PRDM6 in ICGC-PEDBRAIN-MB etc.)

For a detailed description of the algorithm, please refer to Umut Topraks's dissertation at https://doi.org/10.11588/heidok.00027429, in particular chapter 2. Section 2.2.1 describes the method in more details.

SOPHIA is a very fast and resource-light algorithm. It uses 2GB RAM, 2 CPU cores and runs in ~3.5 hours for 50x coverage WGS, and can detect variants with a single pass of the input BAM files. No local assembly is done.

> This is a fork of the original [SOPHIA](https://bitbucket.org/utoprak/sophia/src/master/) bitbucket repository.

Sophia is included in the [SophiaWorkflow](https://github.com/DKFZ-ODCF/SophiaWorkflow) that uses the [Roddy Workflow Management Systems](https://github.com/TheRoddyWMS/Roddy).


### Citing

You can cite Sophia as follows:

Integrative Analysis of Omics Datasets.
Doctoral dissertation, German Cancer Research Center (DKFZ), Heidelberg.
Umut Toprak (2019).
DOI 10.11588/heidok.000274296

## Runtime Dependencies

The only dependency is Boost 1.70.0 (currently). E.g. you can do
Expand Down Expand Up @@ -41,7 +66,7 @@ Note that the build-scripts are for when you manage your dependencies with Conda

### Static Build

If you want to compile statically you need to install glibc and boost static libraries (not possible with Conda, in the moment) and do
If you want to compile statically you need to install glibc and boost static libraries (currently, not possible with Conda) and do

```bash
source activate sophia
Expand All @@ -52,3 +77,17 @@ STATIC=true build-sophia.sh
cd ../Release_sophiaAnnotate
STATIC=true build-sophiaAnnotate.sh
```

## Changes

* 35.1.0 (upcoming)
* Minor: Nominally added support for hg38 (hg37 support remains)
* Minor: Added `--assemblyname` option, defaulting to "hg37" when omitted (old behaviour)
> WARNING: hg38 support was not excessively tested. In particular, yet hardcoded parameters may have to be adjusted.
* Minor: Build script for `sophiaMref`
* Patch: Code readability improvements, `.editorconfig` file, and `clang-format` configuration
* Patch: Improved compilation instructions
* Patch: Use `namespace::std` to get rid of `std::` noise in the code

* 9e3b6ed
* Last version in [bitbucket](https://bitbucket.org/compbio_charite/sophia/src/master/)
24 changes: 20 additions & 4 deletions Release_sophia/build-sophia.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ install_strtk
CPP=x86_64-conda_cos6-linux-gnu-g++
INCLUDES="-I../include -I$CONDA_PREFIX/include"

CPP_OPTS="-L$CONDA_PREFIX/lib -std=c++1z $INCLUDES -O3 -Wall -Wextra -static -static-libgcc -static-libstdc++ -flto -c -fmessage-length=0 -Wno-attributes"
CPP_OPTS="-L$CONDA_PREFIX/lib -std=c++17 $INCLUDES -O3 -Wall -Wextra -static -static-libgcc -static-libstdc++ -flto -c -fmessage-length=0 -Wno-attributes"

if [[ "${STATIC:-false}" == "true" ]]; then
CPP_OPTS="-static -static-libgcc -static-libstdc++ $CPP_OPTS"
Expand All @@ -23,11 +23,27 @@ fi
$CPP $CPP_OPTS -o "Alignment.o" "../src/Alignment.cpp"
$CPP $CPP_OPTS -o "Breakpoint.o" "../src/Breakpoint.cpp"
$CPP $CPP_OPTS -o "ChosenBp.o" "../src/ChosenBp.cpp"
$CPP $CPP_OPTS -o "GlobalAppConfig.o" "../src/GlobalAppConfig.cpp"
$CPP $CPP_OPTS -o "ChrConverter.o" "../src/ChrConverter.cpp"
$CPP $CPP_OPTS -o "Hg37ChrConverter.o" "../src/Hg37ChrConverter.cpp"
$CPP $CPP_OPTS -o "Hg38ChrConverter.o" "../src/Hg38ChrConverter.cpp"
$CPP $CPP_OPTS -o "SamSegmentMapper.o" "../src/SamSegmentMapper.cpp"
$CPP $CPP_OPTS -o "Sdust.o" "../src/Sdust.cpp"
$CPP $CPP_OPTS -o "SuppAlignment.o" "../src/SuppAlignment.cpp"
$CPP $CPP_OPTS -o "HelperFunctions.o" "../src/HelperFunctions.cpp"
$CPP $CPP_OPTS -o "sophia.o" "../sophia.cpp"

$CPP -L$CONDA_PREFIX/lib -flto -o "sophia" Alignment.o Breakpoint.o ChosenBp.o ChrConverter.o SamSegmentMapper.o Sdust.o SuppAlignment.o HelperFunctions.o sophia.o -lboost_program_options
$CPP $CPP_OPTS -o "sophia.o" "../src/sophia.cpp"

$CPP -L$CONDA_PREFIX/lib -flto -o "sophia" \
Alignment.o \
Breakpoint.o \
ChosenBp.o \
ChrConverter.o \
Hg37ChrConverter.o \
Hg38ChrConverter.o \
SamSegmentMapper.o \
Sdust.o \
SuppAlignment.o \
HelperFunctions.o \
GlobalAppConfig.o \
sophia.o \
-lboost_program_options
29 changes: 25 additions & 4 deletions Release_sophiaAnnotate/build-sophiaAnnotate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ install_strtk

CPP=x86_64-conda_cos6-linux-gnu-g++
INCLUDES="-I../include -I$CONDA_PREFIX/include"
CPP_OPTS="-L$CONDA_PREFIX/lib -std=c++1z $INCLUDES -O3 -Wall -Wextra -static -static-libgcc -static-libstdc++ -flto -c -fmessage-length=0 -Wno-attributes"
CPP_OPTS="-L$CONDA_PREFIX/lib -std=c++17 $INCLUDES -O3 -Wall -Wextra -static -static-libgcc -static-libstdc++ -flto -c -fmessage-length=0 -Wno-attributes"

if [[ "${STATIC:-false}" == "true" ]]; then
CPP_OPTS="-static -static-libgcc -static-libstdc++ $CPP_OPTS"
vinjana marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -22,7 +22,10 @@ fi
$CPP $CPP_OPTS -o "AnnotationProcessor.o" "../src/AnnotationProcessor.cpp"
$CPP $CPP_OPTS -o "Breakpoint.o" "../src/Breakpoint.cpp"
$CPP $CPP_OPTS -o "BreakpointReduced.o" "../src/BreakpointReduced.cpp"
$CPP $CPP_OPTS -o "GlobalAppConfig.o" "../src/GlobalAppConfig.cpp"
$CPP $CPP_OPTS -o "ChrConverter.o" "../src/ChrConverter.cpp"
$CPP $CPP_OPTS -o "Hg37ChrConverter.o" "../src/Hg37ChrConverter.cpp"
$CPP $CPP_OPTS -o "Hg38ChrConverter.o" "../src/Hg38ChrConverter.cpp"
$CPP $CPP_OPTS -o "DeFuzzier.o" "../src/DeFuzzier.cpp"
$CPP $CPP_OPTS -o "GermlineMatch.o" "../src/GermlineMatch.cpp"
$CPP $CPP_OPTS -o "MrefEntry.o" "../src/MrefEntry.cpp"
Expand All @@ -32,6 +35,24 @@ $CPP $CPP_OPTS -o "SuppAlignment.o" "../src/SuppAlignment.cpp"
$CPP $CPP_OPTS -o "SuppAlignmentAnno.o" "../src/SuppAlignmentAnno.cpp"
$CPP $CPP_OPTS -o "SvEvent.o" "../src/SvEvent.cpp"
$CPP $CPP_OPTS -o "HelperFunctions.o" "../src/HelperFunctions.cpp"
$CPP $CPP_OPTS -o "sophiaAnnotate.o" "../sophiaAnnotate.cpp"

$CPP -L$CONDA_PREFIX/lib -flto -o "sophiaAnnotate" AnnotationProcessor.o Breakpoint.o BreakpointReduced.o ChrConverter.o DeFuzzier.o GermlineMatch.o MrefEntry.o MrefEntryAnno.o MrefMatch.o SuppAlignment.o SuppAlignmentAnno.o SvEvent.o HelperFunctions.o sophiaAnnotate.o -lz -lboost_system -lboost_iostreams
$CPP $CPP_OPTS -o "sophiaAnnotate.o" "../src/sophiaAnnotate.cpp"

$CPP -L$CONDA_PREFIX/lib -flto -o "sophiaAnnotate" \
AnnotationProcessor.o \
Breakpoint.o \
BreakpointReduced.o \
ChrConverter.o \
Hg37ChrConverter.o \
Hg38ChrConverter.o \
DeFuzzier.o \
GermlineMatch.o \
MrefEntry.o \
MrefEntryAnno.o \
MrefMatch.o \
SuppAlignment.o \
SuppAlignmentAnno.o \
SvEvent.o \
HelperFunctions.o \
GlobalAppConfig.o \
sophiaAnnotate.o \
-lz -lboost_system -lboost_iostreams
56 changes: 56 additions & 0 deletions Release_sophiaMref/build-sophiaMref.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/bin/bash

set -uex
trap 'echo "Compilation failed with an error" >> /dev/stderr' ERR

CONDA_PREFIX="${CONDA_PREFIX:?No CONDA_PREFIX -- no active Conda environment}"

install_strtk() {
wget -c https://github.com/ArashPartow/strtk/raw/master/strtk.hpp -O ../include/strtk.hpp
}

install_strtk

CPP=x86_64-conda_cos6-linux-gnu-g++
INCLUDES="-I../include -I$CONDA_PREFIX/include"
CPP_OPTS="-L$CONDA_PREFIX/lib -std=c++17 $INCLUDES -O3 -Wall -Wextra -static -static-libgcc -static-libstdc++ -flto -c -fmessage-length=0 -Wno-attributes"

if [[ "${STATIC:-false}" == "true" ]]; then
CPP_OPTS="-static -static-libgcc -static-libstdc++ $CPP_OPTS"
fi

$CPP $CPP_OPTS -o "GlobalAppConfig.o" "../src/GlobalAppConfig.cpp"
$CPP $CPP_OPTS -o "ChrConverter.o" "../src/ChrConverter.cpp"
$CPP $CPP_OPTS -o "Hg37ChrConverter.o" "../src/Hg37ChrConverter.cpp"
$CPP $CPP_OPTS -o "Hg38ChrConverter.o" "../src/Hg38ChrConverter.cpp"
$CPP $CPP_OPTS -o "HelperFunctions.o" "../src/HelperFunctions.cpp"
$CPP $CPP_OPTS -o "sophiaMref.o" "../src/sophiaMref.cpp"
$CPP $CPP_OPTS -o "SuppAlignment.o" "../src/SuppAlignment.cpp"
$CPP $CPP_OPTS -o "SuppAlignmentAnno.o" "../src/SuppAlignmentAnno.cpp"
$CPP $CPP_OPTS -o "MrefEntry.o" "../src/MrefEntry.cpp"
$CPP $CPP_OPTS -o "MrefEntryAnno.o" "../src/MrefEntryAnno.cpp"
$CPP $CPP_OPTS -o "MrefMatch.o" "../src/MrefMatch.cpp"
$CPP $CPP_OPTS -o "MasterRefProcessor.o" "../src/MasterRefProcessor.cpp"
$CPP $CPP_OPTS -o "Breakpoint.o" "../src/Breakpoint.cpp"
$CPP $CPP_OPTS -o "BreakpointReduced.o" "../src/BreakpointReduced.cpp"
$CPP $CPP_OPTS -o "GermlineMatch.o" "../src/GermlineMatch.cpp"
$CPP $CPP_OPTS -o "DeFuzzier.o" "../src/DeFuzzier.cpp"

$CPP -L$CONDA_PREFIX/lib -flto -o "sophiaMref" \
GlobalAppConfig.o \
ChrConverter.o \
Hg37ChrConverter.o \
Hg38ChrConverter.o \
HelperFunctions.o \
SuppAlignment.o \
SuppAlignmentAnno.o \
MrefEntry.o \
MrefEntryAnno.o \
MrefMatch.o \
MasterRefProcessor.o \
Breakpoint.o \
BreakpointReduced.o \
GermlineMatch.o \
DeFuzzier.o \
sophiaMref.o \
-lz -lboost_system -lboost_iostreams -lboost_program_options
Loading