Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some kind of breakend enhancement #64

Open
julesjacobsen opened this issue Feb 1, 2022 · 3 comments
Open

Some kind of breakend enhancement #64

julesjacobsen opened this issue Feb 1, 2022 · 3 comments

Comments

@julesjacobsen
Copy link
Contributor

Found some TODOs but can't quite fully grok them right now.

    @ParameterizedTest
    @CsvSource({
            //#CHROM  POS ID   REF  ALT         MATEID EVENTID
            "2,  321681, bndW, G, G]17:198982], bndY, event1",
            "2,  321682, bndV, T, ]13:123456]T, bndU, event2",
            "13, 123456, bndU, C, C[2:321682[,  bndV, event2",
            "13, 123457, bndX, A, [17:198983[A, bndZ, event3",
            "17, 198982, bndY, A, A]2:321681],  bndW, event1",
            "17, 198983, bndZ, C, [13:123457[C, bndX, event3",
            "17, 198983, bndZ, C, C., '', event4",
            "17, 198983, bndZ, C, .C, '', event5",
    })
    public void roundTrip(String chr, int pos, String id, String ref, String alt, String mateId, String eventId) {
        GenomicBreakendVariant variant = vcfConverter.convertBreakend(vcfConverter.parseContig(chr), id, pos, ConfidenceInterval.precise(), ref, alt, ConfidenceInterval.precise(), mateId, eventId);
        // TODO: Want to preserve input CHROM, POS, REF, ALT
        assertThat(variant.coordinateSystem(), equalTo(CoordinateSystem.oneBased()));
        assertThat(variant.strand(), equalTo(Strand.POSITIVE));
        assertThat(variant.start(), equalTo(pos));
        assertThat(variant.end(), equalTo(ref));
        assertThat(variant.ref(), equalTo(ref));
        assertThat(variant.alt(), equalTo(""));
        assertThat(variant.changeLength(), equalTo(0));
        assertThat(variant.isBreakend(), equalTo(true));

        // TODO: BUT when cast to a BreakendVariant it will allow access to breakendRef(), breakendAlt(), left(), right() 
        GenomicBreakendVariant breakendVariant = (GenomicBreakendVariant) variant;
        assertThat(breakendVariant.coordinateSystem(), equalTo(ref));
        assertThat(breakendVariant.strand(), equalTo(ref));
        assertThat(breakendVariant.start(), equalTo(ref));
        assertThat(breakendVariant.end(), equalTo(ref));
        assertThat(breakendVariant.ref(), equalTo(ref));
        assertThat(breakendVariant.alt(), equalTo(ref));
        assertThat(breakendVariant.breakendRef(), equalTo(ref));
        assertThat(breakendVariant.breakendAlt(), equalTo(alt));
        assertThat(breakendVariant.isBreakend(), equalTo(true));
        assertThat(breakendVariant.changeLength(), equalTo(0));
        assertThat(breakendVariant.left(), equalTo(left));
        assertThat(breakendVariant.right(), equalTo(right));

        assertThat(VcfBreakendFormatter.makeAltVcfField(variant), equalTo(alt));
    }
julesjacobsen added a commit that referenced this issue Mar 28, 2023
…so include breakend alt alleles.

Added new GenomicVariant.Builder class for more convenient builder-based construction
Added new GenomicVariant.mateId and GenomicVariant.eventId variables
Added new simpler GenomicVariant static constructors not requiring an id field.
@julesjacobsen
Copy link
Contributor Author

Relaxed the requirements for creating breakends in the same way as other symbolic alleles. If the GenomicBreakend API is required, then the symbolic version can be converted relatively easily:

// Breakends can be treated as a symbolic variant
GenomicVariant bnd = vcfConverter.convertSymbolic(vcfConverter.parseContig("1"), "bnd_U", 12345, 12345, "C", "C[2:321682[", 0, "bnd_V", "tra2");
assertInstanceOf(GenomicVariant.class, bnd);
// or they can be converted to a specialised breakend variant
GenomicVariant bndb = vcfConverter.convertBreakend(vcfConverter.parseContig("1"), "bnd_U", 12345, ConfidenceInterval.precise(), "C", "C[2:321682[", ConfidenceInterval.precise(), "bnd_V", "tra2");
assertInstanceOf(GenomicBreakendVariant.class, bndb);

VcfBreakendResolver vcfBreakendResolver = new VcfBreakendResolver(b37);
GenomicBreakendVariant breakendVariant = vcfBreakendResolver.resolveBreakend(bnd);
assertThat(breakendVariant, equalTo(bndb));
assertThat(bnd.contig(), equalTo(breakendVariant.contig()));
assertThat(bnd.length(), equalTo(breakendVariant.length()));
assertThat(bnd.ref(), equalTo(breakendVariant.ref()));
// C[2:321682[
String breakendAltValue = VcfBreakendFormatter.makeAltVcfField(breakendVariant);
assertThat(bnd.alt(), equalTo(breakendAltValue));
assertThat(bnd.mateId(), equalTo(breakendVariant.mateId()));
assertThat(bnd.eventId(), equalTo(breakendVariant.eventId()));

@julesjacobsen
Copy link
Contributor Author

julesjacobsen commented Jun 24, 2023

Breakend and Symbolic versions of breakend are not fully symmetrical.

i.e. a GenomicBreakendVariant alt allele is always empty but the GenomicVariant version contains the VCF ALT allele representation, e.g.

// Breakends can be treated as a symbolic variant
GenomicVariant bnd = vcfConverter.convertSymbolic(vcfConverter.parseContig("1"), "bnd_U", 12345, 12345, "C", "C[2:321682[", 0, "bnd_V", "tra2");
assertInstanceOf(GenomicVariant.class, bnd);
// or they can be converted to a specialised breakend variant
GenomicVariant bndb = vcfConverter.convertBreakend(vcfConverter.parseContig("1"), "bnd_U", 12345, ConfidenceInterval.precise(), "C", "C[2:321682[", ConfidenceInterval.precise(), "bnd_V", "tra2");
assertInstanceOf(GenomicBreakendVariant.class, bndb);

// C[2:321682[
String breakendAltValue = VcfBreakendFormatter.makeAltVcfField(breakendVariant);
assertThat(bnd.alt(), equalTo(breakendAltValue));
assertThat(bnd.alt(), equalTo("C[2:321682["))
assertThat(bndb.alt(), equalTo("C[2:321682[")) // FAIL!! bndb.alt() is empty, in this case - high WTF score here.

Having GenomicBreakendVariant.alt() always return empty means the VCF ALT information is lost when doing a naive copy of a GenomicBreakendVariant to a GenomicVariant (GenomicBreakendVariant is a GenomicVariant) and then wanting to revert back to a GenomicBreakendVariant.

A solution to this might be to store the sequence of ref and alt in a new GenomicBreakend.seq field which will be the sequence native to that strand and keep the VCF alt allele untouched on the GenomicBreakendVariant. I think this will give us the flexibility of easily inter-converting from a symbolic to breakend variant and back again without losing the alt allele info during this process and thus reduce the WTF value of the codebase. GenomicBreakendVariants are special, but I think we might have made them a bit too special the first time around.

@julesjacobsen
Copy link
Contributor Author

@ielis How does this sound as a solution to maximise ease of use and hopefully reduce the WTF factor:

// consider the following breakend variant in VCF:
// 1	12345	bnd_U	C	C[2:321682[	6	PASS	SVTYPE=BND;MATEID=bnd_V;EVENT=tra2
// Breakends can be treated as a symbolic variant
GenomicVariant bnd = vcfConverter.convertSymbolic(vcfConverter.parseContig("1"), "bnd_U", 12345, 12345, "C", "C[2:321682[", 0, "bnd_V", "tra2");
// or they can be converted to a specialised breakend variant
GenomicBreakendVariant bndb = vcfConverter.convertBreakend(vcfConverter.parseContig("1"), "bnd_U", 12345, "C", "C[2:321682[", "bnd_V", "tra2");

// note that the input values are the SAME as the returned values. No WTF here
assertThat(bnd.isSymbolic(), equalTo(true));
assertThat(bnd.isBreakend(), equalTo(true));
assertThat(bnd.contig().name(), equalTo("1"));
assertThat(bnd.coordinates(), equalTo(Coordinates.oneBased(12345, 12345)));
assertThat(bnd.ref(), equalTo("C"));
assertThat(bnd.alt(), equalTo("C[2:321682["));

// note that the input values are often DIFFERENT to the returned values. High WTF here.
assertThat(bndb.isSymbolic(), equalTo(true));
assertThat(bndb.isBreakend(), equalTo(true));
assertThat(bndb.contig().name(), equalTo("1"));
assertThat(bndb.coordinates(), equalTo(Coordinates.oneBased(12346, 12345))); // would expect Coordinates.oneBased(12345, 12345)
assertThat(bndb.ref(), equalTo("C")); // if this was on the - strand a G would be returned here.
assertThat(bndb.alt(), equalTo("")); // would expect "C[2:321682["
// To get all these values the VcfBreakendFormatter needs to be used
// C[2:321682[
String breakendAltValue = VcfBreakendFormatter.makeAltVcfField(bndb);
// However the GenomicBreakendVariant can be converted back to a symbolic genomic variant again (see commit [36dc08d844b37e7343a60d08640a82f952fc9bbf](https://github.com/exomiser/svart/commit/36dc08d844b37e7343a60d08640a82f952fc9bbf))
// this uses the VcfBreakendFormatter under the hood
GenomicVariant symbolicBreakend = breakendVariant.toSymbolicGenomicVariant();
assertThat(bnd, equalTo(symbolicBreakend));

Proposal

Retain the current methods and constructor signatures but:

  1. Change the argument names (not types) in GenomicBreakendVariant
  2. Change the return values for current GenomicBreakendVariant coordinates(), start(), ref() and alt() methods to return the VCF / symbolic GenomicVariant equivalents.
  3. Add two new methods to GenomicBreakendVariant - breakendRef() and insertSeq()
  4. Remove the now redundant VcfBreakendFormatter

In detail this would mean:

// Current
/**
 * @param eventId The VCF event ID
 * @param left  The left {@link GenomicBreakend}
 * @param right The right {@link GenomicBreakend}
 * @param ref The REF allele <b>AS IT WOULD BE REPRESENTED ON THE LEFT BREAKEND STRAND</b>
 * @param alt Any inserted bases <b>AS IT WOULD BE REPRESENTED ON THE RIGHT BREAKEND STRAND</b> if any, or empty.
 * @return a {@link GenomicBreakendVariant} composed of the input values.
 */
static GenomicBreakendVariant of(String eventId, GenomicBreakend left, GenomicBreakend right, String ref, String alt) {
    return DefaultGenomicBreakendVariant.of(eventId, left, right, ref, alt);
}

// Proposed
/**
 * @param eventId The VCF event ID
 * @param left  The left {@link GenomicBreakend}
 * @param right The right {@link GenomicBreakend}
 * @param breakendRef The REF allele <b>AS IT WOULD BE REPRESENTED ON THE LEFT BREAKEND STRAND</b>
 * @param insertSeq Any inserted bases <b>AS IT WOULD BE REPRESENTED ON THE RIGHT BREAKEND STRAND</b> if any, or empty.
 * @return a {@link GenomicBreakendVariant} composed of the input values.
 */
static GenomicBreakendVariant of(String eventId, GenomicBreakend left, GenomicBreakend right, String breakendRef, String insertSeq) {
    return DefaultGenomicBreakendVariant.of(eventId, left, right, breakendRef, insertSeq);
}
public interface GenomicBreakendVariant extends GenomicVariant {
// added fields
    /**
     * The value of the ref allele as found on the left breakend strand.
     */
    String breakendRef();

    /**
     * The inserted sequence (if any) between the mated breakends.
     */
    String insertSeq();
}

The values of strand(), coordinates(), ref() and alt() will the calculated from the left(), right(), breakendRef() and insertSeq() to return the symbolic GenomicVariant equivalents.

All of which which means this, in the end:

// consider the following breakend variant in VCF:
// 1	12345	bnd_U	C	C[2:321682[	6	PASS	SVTYPE=BND;MATEID=bnd_V;EVENT=tra2
// Breakends can be treated as a symbolic variant
GenomicVariant bnd = vcfConverter.convertSymbolic(vcfConverter.parseContig("1"), "bnd_U", 12345, 12345, "C", "C[2:321682[", 0, "bnd_V", "tra2");
// or they can be converted to a specialised breakend variant
GenomicBreakendVariant bndb = vcfConverter.convertBreakend(vcfConverter.parseContig("1"), "bnd_U", 12345, "C", "C[2:321682[", "bnd_V", "tra2");

// note that the input values are the SAME as the returned values. No WTF here
assertThat(bnd.isSymbolic(), equalTo(true));
assertThat(bnd.isBreakend(), equalTo(true));
assertThat(bnd.contig().name(), equalTo("1"));
assertThat(bnd.strand(), equalTo(Strand.POSITIVE));
assertThat(bnd.coordinates(), equalTo(Coordinates.oneBased(12345, 12345)));
assertThat(bnd.ref(), equalTo("C"));
assertThat(bnd.alt(), equalTo("C[2:321682["));


// note that the input values are the SAME as the returned values. No WTF here either
assertThat(bndb.isSymbolic(), equalTo(true));
assertThat(bndb.isBreakend(), equalTo(true));
assertThat(bndb.contig().name(), equalTo("1"));
assertThat(bndb.strand(), equalTo(Strand.POSITIVE)); // this could be NEGATIVE 
assertThat(bndb.coordinates(), equalTo(Coordinates.oneBased(12345, 12345))); // same as symbolic version
assertThat(bndb.ref(), equalTo("C")); // same as symbolic version
assertThat(bndb.alt(), equalTo("C[2:321682[")); // same as symbolic version
assertThat(bndb.breakendRef(), equalTo("C")); // if this was on the - strand a G would be returned here.
assertThat(bndb.insertSeq(), equalTo(""));

I think that this will really minimise the differences between symbolic and breakend variants yet still allow applications such as Svanna to utilise the GenomicBreakendVariant in an accurate manner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant