Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incircle performance may be able to be faster... #20

Closed
dgleich opened this issue Jul 30, 2024 · 13 comments · Fixed by #21
Closed

incircle performance may be able to be faster... #20

dgleich opened this issue Jul 30, 2024 · 13 comments · Fixed by #21

Comments

@dgleich
Copy link
Contributor

dgleich commented Jul 30, 2024

Just a quick note that the Delaunator port of the incircle routine seems to be consistently faster. It's using ntuples vs. the cached values whenever the size is <= 32, which may have some speed advantage (e.g. it can use vector registers... more efficiently?)

On my machine, I get:

julia> @btime AdaptivePredicates.incircle((1.0+eps(1.0),eps(1.0)),(eps(1.0)*1.33,eps(1.0)*-1.33),(eps(1.0),1.0-eps(1.0)),(1.0,0))
  1.266 μs (0 allocations: 0 bytes)

julia> @btime Delaunator.incircle((1.0+eps(1.0),eps(1.0)),(eps(1.0)*1.33,eps(1.0)*-1.33),(eps(1.0),1.0-eps(1.0)),(1.0,0))
  706.415 ns (0 allocations: 0 bytes)

I don't think this is due to the difference in the cache as this case doesn't hit the cache ... just wanted to mention it in case you wanted to move any of that implementation back here :)

@DanielVandH
Copy link
Member

DanielVandH commented Jul 30, 2024

Is it easy for you to see what happens to the timings if Delaunator.jl uses your implementation vs AdaptivePredicates.jl's? It might be easier to see what the difference is with a larger set of calls to incircle that way. Comparing point sets in general position vs. with lots of collinearity would be of interest too, since perhaps retrying the same computation becomes inefficient in the latter case.

Most expansions have less than 32 coefficients (from what I can recall, I think most get to around 10..), so I suspect that your approach is worth the effort anyway. But seeing it in action would be nice too.

@dgleich
Copy link
Contributor Author

dgleich commented Jul 30, 2024

My default test case didn't hit the adaptive parts of the test, so it wouldn't show any difference in this portion. Will try and see if any of the other ones do.

@dgleich
Copy link
Contributor Author

dgleich commented Jul 31, 2024

Okay, on the two robustness tests in the Delaunator library, there is a noticeable difference:

julia> @btime triangulate($robustness2);
  627.470 μs (32 allocations: 167.91 KiB)

julia> @btime triangulate($robustness2); # with AdaptivePredicates.incircle 
  781.256 μs (32 allocations: 167.91 KiB)

julia> @btime triangulate($robustness1);
  151.950 μs (29 allocations: 44.83 KiB)

julia> @btime triangulate($robustness1);  # with AdaptivePredicates.incircle 
  220.399 μs (29 allocations: 44.83 KiB)

@DanielVandH
Copy link
Member

Interesting. Thanks for that. When I have time I will try and implement your approach

@vchuravy
Copy link
Collaborator

vchuravy commented Aug 2, 2024

To give a bit of context.

Originally I implemented this using StaticArrays MVector. MVector closely matches what C does originally with the condition that "all uses of the MVector must be non-escaping e.g. inlined". As far as I know we really want these arrays to be stack allocated and so the caching approach seems backwards to me.

In #1 @dgleich removed the StaticArrays dependency replacing MVector with Tuple and using the copying setindex version. Generally this is fine as long as the tuples remain smallish.
Given https://github.com/JuliaGeometry/Delaunator.jl/blob/f1ba75a2a0686a4ce2d48d876675819032ece1b0/src/robust.jl#L514-L516

I do wonder if the MArray version would have been able to scale beyond 32 elements. MArrays require a bit of care while coding, but since this code is in the hot loop that care seems warranted.

Lastly the TLS caching approach seems to complicated for it's own good. It is probably better and faster to just hit the allocator like @dgleich code does. Especially if you are doing an undef alloc and not a zeroed out alloc.

@DanielVandH
Copy link
Member

DanielVandH commented Aug 2, 2024

I do wonder if the MArray version would have been able to scale beyond 32 elements.

Can the code handle the creation of an MArray with thousands of elements? I've never used an MVector so I'm not familiar.

"all uses of the MVector must be non-escaping e.g. inlined"

I don't know how this would be guaranteed in general.

Lastly the TLS caching approach seems to complicated for it's own good. It is probably better and faster to just hit the allocator like @dgleich code does. Especially if you are doing an undef alloc and not a zeroed out alloc.

I'm happy to look into it again, but when I benchmarked it while implementing it caching was fine.

@vchuravy
Copy link
Collaborator

vchuravy commented Aug 2, 2024

I don't know how this would be guaranteed in general.

The simple pattern I use is: Allocate within a function and if I return a MVector return an SArray of it instead.

@DanielVandH
Copy link
Member

You mean e.g. the functions where a cached h is provided, provide a Val(N) (or maybe just an N), do h = MVector{...}(...) and then at the end return SVector(h) and make use of that?

@DanielVandH
Copy link
Member

I think either way caching is fine from what I have benchmarked, I don't see the complications. It's not really relevant to the idea of trying to do the most with 32-length Tuples before trying to go further, which is all I intend to do for now. A separate issue for that can be made if desired.

@vchuravy
Copy link
Collaborator

vchuravy commented Aug 2, 2024

I don't see the complications.

Looking at I wonder why we have struct that now contain both NTuples and Vec.
https://github.com/JuliaGeometry/AdaptivePredicates.jl/blob/main/src/caches.jl

The code is less clear and the compiler has to do more work to first get rid of Orient2Cache and then the ntuple.

I think the strategy of going to a Vector when we are past 32 or 64 elements in an ntuple is fine.
But right now we are paying the dict lookup cost on every call to Orient3Cache 12 times + 1 TLS lookup.

So for me the complications are that the code has gotten more complex through the introduction of a cache,
instead of less so. (I tried intentionally to keep it very similar to the C version).

I would simply allocate an array at those sizes and not do any form of caching. It ought to be the "rare" case and the array allocation is fast on Julia 1.11 Vector{Float64}(undef, N) for N <= 223 takes < 14ns; After that the cost grows a bit and levels out around 50ns. A TLS access is ~14ns a dictionary lookup is around the same. So for me the complexity cost is not worth it and I rather keep the code closer to the C version.

@DanielVandH
Copy link
Member

DanielVandH commented Aug 2, 2024

The code is still very similar to the C version. I'll think about your approach again, then, but I don't think the performance will change drastically if at all - the adaptive methods are only even needed rarely.

My plans for now will be to just try a vector for > 32 elements, and later I can consider this and see what happens to the implementation.

If you want to make a PR rather than wait for me to get to it that's fine too. For me it's really only worth considering if it does actually make a difference inside DelaunayTriangulation.jl or Delaunator.jl (or any other algorithms making heavy use of this).

@dgleich
Copy link
Contributor Author

dgleich commented Aug 2, 2024

Since I still had the code up ... using MVectors instead of the any type of caching...

julia> @btime Delaunator.incircle((1.0+eps(1.0),eps(1.0)),(eps(1.0)*1.33,eps(1.0)*-1.33),(eps(1.0),1.0-eps(1.0)),(1.0,0))
  643.245 ns (1 allocation: 9.06 KiB)
-1.9721522630525293e-31

julia> @btime triangulate($robustness1);
  149.919 μs (143 allocations: 1.05 MiB)

julia> @btime triangulate($robustness2);
  620.479 μs (318 allocations: 2.70 MiB)

This uses ntuples for <= 32 and MVector >= 48 ...

Just for completeness, here's the routine.

function _incircleadapt_round1(pa, pb, pc, pd, permanent, cache=nothing)
  @assert(cache === nothing)
  T = Float64
  @inbounds begin
    adx = pa[1] - pd[1]
    bdx = pb[1] - pd[1]
    cdx = pc[1] - pd[1]
    ady = pa[2] - pd[2]
    bdy = pb[2] - pd[2]
    cdy = pc[2] - pd[2]

    bdxcdy1, bdxcdy0 = Two_Product(bdx, cdy)
    cdxbdy1, cdxbdy0 = Two_Product(cdx, bdy)
    bc3, bc2, bc1, bc0 = Two_Two_Diff(bdxcdy1, bdxcdy0, cdxbdy1, cdxbdy0)
    bc = (bc0, bc1, bc2, bc3)
    axbc, axbclen = scale_expansion_zeroelim(4, bc, adx, Val(8))
    axxbc, axxbclen = scale_expansion_zeroelim(axbclen, axbc, adx, Val(16))
    aybc, aybclen = scale_expansion_zeroelim(4, bc, ady, Val(8))
    ayybc, ayybclen = scale_expansion_zeroelim(aybclen, aybc, ady, Val(16))
    adet, alen = fast_expansion_sum_zeroelim(axxbclen, axxbc, ayybclen, ayybc, Val(32))

    cdxady1, cdxady0 = Two_Product(cdx, ady)
    adxcdy1, adxcdy0 = Two_Product(adx, cdy)
    ca3, ca2, ca1, ca0 = Two_Two_Diff(cdxady1, cdxady0, adxcdy1, adxcdy0)
    ca = (ca0, ca1, ca2, ca3)
    bxca, bxcalen = scale_expansion_zeroelim(4, ca, bdx, Val(8))
    bxxca, bxxcalen = scale_expansion_zeroelim(bxcalen, bxca, bdx, Val(16))
    byca, bycalen = scale_expansion_zeroelim(4, ca, bdy, Val(8))
    byyca, byycalen = scale_expansion_zeroelim(bycalen, byca, bdy, Val(16))
    bdet, blen = fast_expansion_sum_zeroelim(bxxcalen, bxxca, byycalen, byyca, Val(32))

    adxbdy1, adxbdy0 = Two_Product(adx, bdy)
    bdxady1, bdxady0 = Two_Product(bdx, ady)
    ab3, ab2, ab1, ab0 = Two_Two_Diff(adxbdy1, adxbdy0, bdxady1, bdxady0)
    ab = (ab0, ab1, ab2, ab3)
    cxab, cxablen = scale_expansion_zeroelim(4, ab, cdx, Val(8))
    cxxab, cxxablen = scale_expansion_zeroelim(cxablen, cxab, cdx, Val(16))
    cyab, cyablen = scale_expansion_zeroelim(4, ab, cdy, Val(8))
    cyyab, cyyablen = scale_expansion_zeroelim(cyablen, cyab, cdy, Val(16))
    cdet, clen = fast_expansion_sum_zeroelim(cxxablen, cxxab, cyyablen, cyyab, Val(32))

    # Julia can handle up to Val(32) without any allocations or weird behavior
    # so let's try and check if we can solve this within Val(32)... 
    # if not, then we will fall back to the full expansion... 
    abdet, ablen = fast_expansion_sum_zeroelim(alen, adet, blen, bdet, MVector{64,Float64}(undef))
    fin1, finlength = fast_expansion_sum_zeroelim(ablen, abdet, clen, cdet, MVector{1152,Float64}(undef))

    # recheck termination condition...
    det = estimate(finlength, fin1)
    errbound = iccerrboundB * permanent
    if (det ≥ errbound) || (-det ≥ errbound)
        return det
    end

    adxtail = Two_Diff_Tail(pa[1], pd[1], adx)
    adytail = Two_Diff_Tail(pa[2], pd[2], ady)
    bdxtail = Two_Diff_Tail(pb[1], pd[1], bdx)
    bdytail = Two_Diff_Tail(pb[2], pd[2], bdy)
    cdxtail = Two_Diff_Tail(pc[1], pd[1], cdx)
    cdytail = Two_Diff_Tail(pc[2], pd[2], cdy)

    if iszero(adxtail) && iszero(bdxtail) && iszero(cdxtail) &&
      iszero(adytail) && iszero(bdytail) && iszero(cdytail)
        return det
    end

    errbound = iccerrboundC * permanent + resulterrbound * Absolute(det)
    detadd = ((adx * adx + ady * ady) * ((bdx * cdytail + cdy * bdxtail) -
                                        (bdy * cdxtail + cdx * bdytail)) +
              2.0 * (adx * adxtail + ady * adytail) * (bdx * cdy - bdy * cdx)) +
            ((bdx * bdx + bdy * bdy) * ((cdx * adytail + ady * cdxtail) -
                                        (cdy * adxtail + adx * cdytail)) +
              2.0 * (bdx * bdxtail + bdy * bdytail) * (cdx * ady - cdy * adx)) +
            ((cdx * cdx + cdy * cdy) * ((adx * bdytail + bdy * adxtail) -
                                        (ady * bdxtail + bdx * adytail)) +
              2.0 * (cdx * cdxtail + cdy * cdytail) * (adx * bdy - ady * bdx))
    det = T(det + detadd) # Had to change this to match how C handles the 2.0 multiplication with Float32
    if (det ≥ errbound) || (-det ≥ errbound)
        return det
    end

    finnow = fin1
    finother = MVector{1152,Float64}(undef)
    h48 = MVector{48,Float64}(undef)
    h64 = MVector{64,Float64}(undef)

    if !iszero(bdxtail) || !iszero(bdytail) || !iszero(cdxtail) || !iszero(cdytail)
        adxadx1, adxadx0 = Square(adx)
        adyady1, adyady0 = Square(ady)
        aa3, aa2, aa1, aa0 = Two_Two_Sum(adxadx1, adxadx0, adyady1, adyady0)
        aa = (aa0, aa1, aa2, aa3)
    end
    if !iszero(cdxtail) || !iszero(cdytail) || !iszero(adxtail) || !iszero(adytail)
        bdxbdx1, bdxbdx0 = Square(bdx)
        bdybdy1, bdybdy0 = Square(bdy)
        bb3, bb2, bb1, bb0 = Two_Two_Sum(bdxbdx1, bdxbdx0, bdybdy1, bdybdy0)
        bb = (bb0, bb1, bb2, bb3)
    end
    if !iszero(adxtail) || !iszero(adytail) || !iszero(bdxtail) || !iszero(bdytail)
        cdxcdx1, cdxcdx0 = Square(cdx)
        cdycdy1, cdycdy0 = Square(cdy)
        cc3, cc2, cc1, cc0 = Two_Two_Sum(cdxcdx1, cdxcdx0, cdycdy1, cdycdy0)
        cc = (cc0, cc1, cc2, cc3)
    end

    if !iszero(adxtail)
        axtbc, axtbclen = scale_expansion_zeroelim(4, bc, adxtail, Val(8))
        temp16a, temp16alen = scale_expansion_zeroelim(axtbclen, axtbc, 2adx, Val(16))

        axtcc, axtcclen = scale_expansion_zeroelim(4, cc, adxtail, Val(8))
        temp16b, temp16blen = scale_expansion_zeroelim(axtcclen, axtcc, bdy, Val(16))

        axtbb, axtbblen = scale_expansion_zeroelim(4, bb, adxtail, Val(8))
        temp16c, temp16clen = scale_expansion_zeroelim(axtbblen, axtbb, -cdy, Val(16))

        temp32a, temp32alen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
        temp48, temp48len = fast_expansion_sum_zeroelim(temp16clen, temp16c, temp32alen, temp32a, h48)
        finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
        finnow, finother = finother, finnow
    end
    if !iszero(adytail)
        aytbc, aytbclen = scale_expansion_zeroelim(4, bc, adytail, Val(8))
        temp16a, temp16alen = scale_expansion_zeroelim(aytbclen, aytbc, 2 * ady, Val(16))

        aytbb, aytbblen = scale_expansion_zeroelim(4, bb, adytail, Val(8))
        temp16b, temp16blen = scale_expansion_zeroelim(aytbblen, aytbb, cdx, Val(16))

        aytcc, aytcclen = scale_expansion_zeroelim(4, cc, adytail, Val(8))
        temp16c, temp16clen = scale_expansion_zeroelim(aytcclen, aytcc, -bdx, Val(16))

        temp32a, temp32alen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
        temp48, temp48len = fast_expansion_sum_zeroelim(temp16clen, temp16c, temp32alen, temp32a, h48)
        finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
        finnow, finother = finother, finnow
    end
    if !iszero(bdxtail)
        bxtca, bxtcalen = scale_expansion_zeroelim(4, ca, bdxtail, Val(8))
        temp16a, temp16alen = scale_expansion_zeroelim(bxtcalen, bxtca, 2 * bdx, Val(16))

        bxtaa, bxtaalen = scale_expansion_zeroelim(4, aa, bdxtail, Val(8))
        temp16b, temp16blen = scale_expansion_zeroelim(bxtaalen, bxtaa, cdy, Val(16))

        bxtcc, bxtcclen = scale_expansion_zeroelim(4, cc, bdxtail, Val(8))
        temp16c, temp16clen = scale_expansion_zeroelim(bxtcclen, bxtcc, -ady, Val(16))

        temp32a, temp32alen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
        temp48, temp48len = fast_expansion_sum_zeroelim(temp16clen, temp16c, temp32alen, temp32a, h48)
        finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
        finnow, finother = finother, finnow
    end
    if !iszero(bdytail)
        bytca, bytcalen = scale_expansion_zeroelim(4, ca, bdytail, Val(8))
        temp16a, temp16alen = scale_expansion_zeroelim(bytcalen, bytca, 2 * bdy, Val(16))

        bytcc, bytcclen = scale_expansion_zeroelim(4, cc, bdytail, Val(8))
        temp16b, temp16blen = scale_expansion_zeroelim(bytcclen, bytcc, adx, Val(16))

        bytaa, bytaalen = scale_expansion_zeroelim(4, aa, bdytail, Val(8))
        temp16c, temp16clen = scale_expansion_zeroelim(bytaalen, bytaa, -cdx, Val(16))

        temp32a, temp32alen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
        temp48, temp48len = fast_expansion_sum_zeroelim(temp16clen, temp16c, temp32alen, temp32a, h48)
        finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
        finnow, finother = finother, finnow
    end
    if !iszero(cdxtail)
        cxtab, cxtablen = scale_expansion_zeroelim(4, ab, cdxtail, Val(8))
        temp16a, temp16alen = scale_expansion_zeroelim(cxtablen, cxtab, 2 * cdx, Val(16))

        cxtbb, cxtbblen = scale_expansion_zeroelim(4, bb, cdxtail, Val(8))
        temp16b, temp16blen = scale_expansion_zeroelim(cxtbblen, cxtbb, ady, Val(16))

        cxtaa, cxtaalen = scale_expansion_zeroelim(4, aa, cdxtail, Val(8))
        temp16c, temp16clen = scale_expansion_zeroelim(cxtaalen, cxtaa, -bdy, Val(16))

        temp32a, temp32alen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
        temp48, temp48len = fast_expansion_sum_zeroelim(temp16clen, temp16c, temp32alen, temp32a, h48)
        finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
        finnow, finother = finother, finnow
    end
    if !iszero(cdytail)
        cytab, cytablen = scale_expansion_zeroelim(4, ab, cdytail, Val(8))
        temp16a, temp16alen = scale_expansion_zeroelim(cytablen, cytab, 2 * cdy, Val(16))

        cytaa, cytaalen = scale_expansion_zeroelim(4, aa, cdytail, Val(8))
        temp16b, temp16blen = scale_expansion_zeroelim(cytaalen, cytaa, bdx, Val(16))

        cytbb, cytbblen = scale_expansion_zeroelim(4, bb, cdytail, Val(8))
        temp16c, temp16clen = scale_expansion_zeroelim(cytbblen, cytbb, -adx, Val(16))

        temp32a, temp32alen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
        temp48, temp48len = fast_expansion_sum_zeroelim(temp16clen, temp16c, temp32alen, temp32a, h48)
        finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
        finnow, finother = finother, finnow
    end

    if !iszero(adxtail) || !iszero(adytail)
        if !iszero(bdxtail) || !iszero(bdytail) || !iszero(cdxtail) || !iszero(cdytail)
            ti1, ti0 = Two_Product(bdxtail, cdy)
            tj1, tj0 = Two_Product(bdx, cdytail)
            u3, u2, u1, u0 = Two_Two_Sum(ti1, ti0, tj1, tj0)
            u = (u0, u1, u2, u3)
            negate = -bdy
            ti1, ti0 = Two_Product(cdxtail, negate)
            negate = -bdytail
            tj1, tj0 = Two_Product(cdx, negate)
            v3, v2, v1, v0 = Two_Two_Sum(ti1, ti0, tj1, tj0)
            v = (v0, v1, v2, v3)
            bct, bctlen = fast_expansion_sum_zeroelim(4, u, 4, v, Val(8))

            ti1, ti0 = Two_Product(bdxtail, cdytail)
            tj1, tj0 = Two_Product(cdxtail, bdytail)
            bctt3, bctt2, bctt1, bctt0 = Two_Two_Diff(ti1, ti0, tj1, tj0)
            bctt = (bctt0, bctt1, bctt2, bctt3)
            bcttlen = 4
        else
            bct = (zero(T), zero(T), zero(T), zero(T), zero(T), zero(T), zero(T), zero(T))
            bctlen = 1
            bctt = (zero(T), zero(T), zero(T), zero(T))
            bcttlen = 1
        end

        if !iszero(adxtail)
            temp16a, temp16alen = scale_expansion_zeroelim(axtbclen, axtbc, adxtail, Val(16))
            axtbct, axtbctlen = scale_expansion_zeroelim(bctlen, bct, adxtail, Val(16))
            temp32a, temp32alen = scale_expansion_zeroelim(axtbctlen, axtbct, 2adx, Val(32))
            temp48, temp48len = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp32alen, temp32a, h48)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
            finnow, finother = finother, finnow
            if !iszero(bdytail)
                temp8, temp8len = scale_expansion_zeroelim(4, cc, adxtail, Val(8))
                temp16a, temp16alen = scale_expansion_zeroelim(temp8len, temp8, bdytail, Val(16))
                finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp16alen, temp16a, finother)
                finnow, finother = finother, finnow
            end
            if !iszero(cdytail)
                temp8, temp8len = scale_expansion_zeroelim(4, bb, -adxtail, Val(8))
                temp16a, temp16alen = scale_expansion_zeroelim(temp8len, temp8, cdytail, Val(16))
                finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp16alen, temp16a, finother)
                finnow, finother = finother, finnow
            end

            temp32a, temp32alen = scale_expansion_zeroelim(axtbctlen, axtbct, adxtail, Val(32))
            axtbctt, axtbcttlen = scale_expansion_zeroelim(bcttlen, bctt, adxtail, Val(8))
            temp16a, temp16alen = scale_expansion_zeroelim(axtbcttlen, axtbctt, 2adx, Val(16))
            temp16b, temp16blen = scale_expansion_zeroelim(axtbcttlen, axtbctt, adxtail, Val(16))
            temp32b, temp32blen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
            temp64, temp64len = fast_expansion_sum_zeroelim(temp32alen, temp32a, temp32blen, temp32b, h64)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp64len, temp64, finother)
            finnow, finother = finother, finnow
        end
        if !iszero(adytail)
            temp16a, temp16alen = scale_expansion_zeroelim(aytbclen, aytbc, adytail, Val(16))
            aytbct, aytbctlen = scale_expansion_zeroelim(bctlen, bct, adytail, Val(16))
            temp32a, temp32alen = scale_expansion_zeroelim(aytbctlen, aytbct, 2ady, Val(32))
            temp48, temp48len = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp32alen, temp32a, h48)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
            finnow, finother = finother, finnow

            temp32a, temp32alen = scale_expansion_zeroelim(aytbctlen, aytbct, adytail, Val(32))
            aytbctt, aytbcttlen = scale_expansion_zeroelim(bcttlen, bctt, adytail, Val(8))
            temp16a, temp16alen = scale_expansion_zeroelim(aytbcttlen, aytbctt, 2ady, Val(16))
            temp16b, temp16blen = scale_expansion_zeroelim(aytbcttlen, aytbctt, adytail, Val(16))
            temp32b, temp32blen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
            temp64, temp64len = fast_expansion_sum_zeroelim(temp32alen, temp32a, temp32blen, temp32b, h64)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp64len, temp64, finother)
            finnow, finother = finother, finnow
        end
    end

    if !iszero(bdxtail) || !iszero(bdytail)
        if !iszero(cdxtail) || !iszero(cdytail) || !iszero(adxtail) || !iszero(adytail)
            ti1, ti0 = Two_Product(cdxtail, ady)
            tj1, tj0 = Two_Product(cdx, adytail)
            u3, u2, u1, u0 = Two_Two_Sum(ti1, ti0, tj1, tj0)
            u = (u0, u1, u2, u3)
            negate = -cdy
            ti1, ti0 = Two_Product(adxtail, negate)
            negate = -cdytail
            tj1, tj0 = Two_Product(adx, negate)
            v3, v2, v1, v0 = Two_Two_Sum(ti1, ti0, tj1, tj0)
            v = (v0, v1, v2, v3)
            cat, catlen = fast_expansion_sum_zeroelim(4, u, 4, v, Val(8))

            ti1, ti0 = Two_Product(cdxtail, adytail)
            tj1, tj0 = Two_Product(adxtail, cdytail)
            catt3, catt2, catt1, catt0 = Two_Two_Diff(ti1, ti0, tj1, tj0)
            catt = (catt0, catt1, catt2, catt3)
            cattlen = 4
        else
            cat = (zero(T), zero(T), zero(T), zero(T), zero(T), zero(T), zero(T), zero(T))
            catlen = 1
            catt = (zero(T), zero(T), zero(T), zero(T))
            cattlen = 1
        end

        if !iszero(bdxtail)
            temp16a, temp16alen = scale_expansion_zeroelim(bxtcalen, bxtca, bdxtail, Val(16))
            bxtcat, bxtcatlen = scale_expansion_zeroelim(catlen, cat, bdxtail, Val(16))
            temp32a, temp32alen = scale_expansion_zeroelim(bxtcatlen, bxtcat, 2bdx, Val(32))
            temp48, temp48len = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp32alen, temp32a, h48)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
            finnow, finother = finother, finnow
            if !iszero(cdytail)
                temp8, temp8len = scale_expansion_zeroelim(4, aa, bdxtail, Val(8))
                temp16a, temp16alen = scale_expansion_zeroelim(temp8len, temp8, cdytail, Val(16))
                finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp16alen, temp16a, finother)
                finnow, finother = finother, finnow
            end
            if !iszero(adytail)
                temp8, temp8len = scale_expansion_zeroelim(4, cc, -bdxtail, Val(8))
                temp16a, temp16alen = scale_expansion_zeroelim(temp8len, temp8, adytail, Val(16))
                finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp16alen, temp16a, finother)
                finnow, finother = finother, finnow
            end

            temp32a, temp32alen = scale_expansion_zeroelim(bxtcatlen, bxtcat, bdxtail, Val(32))
            bxtcatt, bxtcattlen = scale_expansion_zeroelim(cattlen, catt, bdxtail, Val(8))
            temp16a, temp16alen = scale_expansion_zeroelim(bxtcattlen, bxtcatt, 2bdx, Val(16))
            temp16b, temp16blen = scale_expansion_zeroelim(bxtcattlen, bxtcatt, bdxtail, Val(16))
            temp32b, temp32blen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
            temp64, temp64len = fast_expansion_sum_zeroelim(temp32alen, temp32a, temp32blen, temp32b, h64)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp64len, temp64, finother)
            finnow, finother = finother, finnow
        end
        if !iszero(bdytail)
            temp16a, temp16alen = scale_expansion_zeroelim(bytcalen, bytca, bdytail, Val(16))
            bytcat, bytcatlen = scale_expansion_zeroelim(catlen, cat, bdytail, Val(16))
            temp32a, temp32alen = scale_expansion_zeroelim(bytcatlen, bytcat, 2bdy, Val(32))
            temp48, temp48len = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp32alen, temp32a, h48)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
            finnow, finother = finother, finnow

            temp32a, temp32alen = scale_expansion_zeroelim(bytcatlen, bytcat, bdytail, Val(32))
            bytcatt, bytcattlen = scale_expansion_zeroelim(cattlen, catt, bdytail, Val(8))
            temp16a, temp16alen = scale_expansion_zeroelim(bytcattlen, bytcatt, 2bdy, Val(16))
            temp16b, temp16blen = scale_expansion_zeroelim(bytcattlen, bytcatt, bdytail, Val(16))
            temp32b, temp32blen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
            temp64, temp64len = fast_expansion_sum_zeroelim(temp32alen, temp32a, temp32blen, temp32b, h64)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp64len, temp64, finother)
            finnow, finother = finother, finnow
        end
    end

    if !iszero(cdxtail) || !iszero(cdytail)
        if !iszero(adxtail) || !iszero(adytail) || !iszero(bdxtail) || !iszero(bdytail)
            ti1, ti0 = Two_Product(adxtail, bdy)
            tj1, tj0 = Two_Product(adx, bdytail)
            u3, u2, u1, u0 = Two_Two_Sum(ti1, ti0, tj1, tj0)
            u = (u0, u1, u2, u3)
            negate = -ady
            ti1, ti0 = Two_Product(bdxtail, negate)
            negate = -adytail
            tj1, tj0 = Two_Product(bdx, negate)
            v3, v2, v1, v0 = Two_Two_Sum(ti1, ti0, tj1, tj0)
            v = (v0, v1, v2, v3)
            abt, abtlen = fast_expansion_sum_zeroelim(4, u, 4, v, Val(8))

            ti1, ti0 = Two_Product(adxtail, bdytail)
            tj1, tj0 = Two_Product(bdxtail, adytail)
            abtt3, abtt2, abtt1, abtt0 = Two_Two_Diff(ti1, ti0, tj1, tj0)
            abtt = (abtt0, abtt1, abtt2, abtt3)
            abttlen = 4
        else
            abt = (zero(T), zero(T), zero(T), zero(T), zero(T), zero(T), zero(T), zero(T))
            abtlen = 1
            abtt = (zero(T), zero(T), zero(T), zero(T))
            abttlen = 1
        end

        if !iszero(cdxtail)
            temp16a, temp16alen = scale_expansion_zeroelim(cxtablen, cxtab, cdxtail, Val(16))
            cxtabt, cxtabtlen = scale_expansion_zeroelim(abtlen, abt, cdxtail, Val(16))
            temp32a, temp32alen = scale_expansion_zeroelim(cxtabtlen, cxtabt, 2cdx, Val(32))
            temp48, temp48len = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp32alen, temp32a, h48)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
            finnow, finother = finother, finnow
            if !iszero(adytail)
                temp8, temp8len = scale_expansion_zeroelim(4, bb, cdxtail, Val(8))
                temp16a, temp16alen = scale_expansion_zeroelim(temp8len, temp8, adytail, Val(16))
                finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp16alen, temp16a, finother)
                finnow, finother = finother, finnow
            end
            if !iszero(bdytail)
                temp8, temp8len = scale_expansion_zeroelim(4, aa, -cdxtail, Val(8))
                temp16a, temp16alen = scale_expansion_zeroelim(temp8len, temp8, bdytail, Val(16))
                finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp16alen, temp16a, finother)
                finnow, finother = finother, finnow
            end

            temp32a, temp32alen = scale_expansion_zeroelim(cxtabtlen, cxtabt, cdxtail, Val(32))
            cxtabtt, cxtabttlen = scale_expansion_zeroelim(abttlen, abtt, cdxtail, Val(8))
            temp16a, temp16alen = scale_expansion_zeroelim(cxtabttlen, cxtabtt, 2cdx, Val(16))
            temp16b, temp16blen = scale_expansion_zeroelim(cxtabttlen, cxtabtt, cdxtail, Val(16))
            temp32b, temp32blen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
            temp64, temp64len = fast_expansion_sum_zeroelim(temp32alen, temp32a, temp32blen, temp32b, h64)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp64len, temp64, finother)
            finnow, finother = finother, finnow
        end
        if !iszero(cdytail)
            temp16a, temp16alen = scale_expansion_zeroelim(cytablen, cytab, cdytail, Val(16))
            cytabt, cytabtlen = scale_expansion_zeroelim(abtlen, abt, cdytail, Val(16))
            temp32a, temp32alen = scale_expansion_zeroelim(cytabtlen, cytabt, 2cdy, Val(32))
            temp48, temp48len = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp32alen, temp32a, h48)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
            finnow, finother = finother, finnow

            temp32a, temp32alen = scale_expansion_zeroelim(cytabtlen, cytabt, cdytail, Val(32))
            cytabtt, cytabttlen = scale_expansion_zeroelim(abttlen, abtt, cdytail, Val(8))
            temp16a, temp16alen = scale_expansion_zeroelim(cytabttlen, cytabtt, 2cdy, Val(16))
            temp16b, temp16blen = scale_expansion_zeroelim(cytabttlen, cytabtt, cdytail, Val(16))
            temp32b, temp32blen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
            temp64, temp64len = fast_expansion_sum_zeroelim(temp32alen, temp32a, temp32blen, temp32b, h64)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp64len, temp64, finother)
            finnow, finother = finother, finnow
        end
    end
    return finnow[finlength]
  end
end 

@DanielVandH
Copy link
Member

#21 addresses this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants