incircle performance may be able to be faster... #20

dgleich · 2024-07-30T20:24:38Z

Just a quick note that the Delaunator port of the incircle routine seems to be consistently faster. It's using ntuples vs. the cached values whenever the size is <= 32, which may have some speed advantage (e.g. it can use vector registers... more efficiently?)

On my machine, I get:

julia> @btime AdaptivePredicates.incircle((1.0+eps(1.0),eps(1.0)),(eps(1.0)*1.33,eps(1.0)*-1.33),(eps(1.0),1.0-eps(1.0)),(1.0,0))
  1.266 μs (0 allocations: 0 bytes)

julia> @btime Delaunator.incircle((1.0+eps(1.0),eps(1.0)),(eps(1.0)*1.33,eps(1.0)*-1.33),(eps(1.0),1.0-eps(1.0)),(1.0,0))
  706.415 ns (0 allocations: 0 bytes)

I don't think this is due to the difference in the cache as this case doesn't hit the cache ... just wanted to mention it in case you wanted to move any of that implementation back here :)

The text was updated successfully, but these errors were encountered:

DanielVandH · 2024-07-30T20:28:46Z

Is it easy for you to see what happens to the timings if Delaunator.jl uses your implementation vs AdaptivePredicates.jl's? It might be easier to see what the difference is with a larger set of calls to incircle that way. Comparing point sets in general position vs. with lots of collinearity would be of interest too, since perhaps retrying the same computation becomes inefficient in the latter case.

Most expansions have less than 32 coefficients (from what I can recall, I think most get to around 10..), so I suspect that your approach is worth the effort anyway. But seeing it in action would be nice too.

dgleich · 2024-07-30T21:00:26Z

My default test case didn't hit the adaptive parts of the test, so it wouldn't show any difference in this portion. Will try and see if any of the other ones do.

dgleich · 2024-07-31T00:49:56Z

Okay, on the two robustness tests in the Delaunator library, there is a noticeable difference:

julia> @btime triangulate($robustness2);
  627.470 μs (32 allocations: 167.91 KiB)

julia> @btime triangulate($robustness2); # with AdaptivePredicates.incircle 
  781.256 μs (32 allocations: 167.91 KiB)

julia> @btime triangulate($robustness1);
  151.950 μs (29 allocations: 44.83 KiB)

julia> @btime triangulate($robustness1);  # with AdaptivePredicates.incircle 
  220.399 μs (29 allocations: 44.83 KiB)

DanielVandH · 2024-07-31T05:38:22Z

Interesting. Thanks for that. When I have time I will try and implement your approach

vchuravy · 2024-08-02T07:28:58Z

To give a bit of context.

Originally I implemented this using StaticArrays MVector. MVector closely matches what C does originally with the condition that "all uses of the MVector must be non-escaping e.g. inlined". As far as I know we really want these arrays to be stack allocated and so the caching approach seems backwards to me.

In #1 @dgleich removed the StaticArrays dependency replacing MVector with Tuple and using the copying setindex version. Generally this is fine as long as the tuples remain smallish.
Given https://github.com/JuliaGeometry/Delaunator.jl/blob/f1ba75a2a0686a4ce2d48d876675819032ece1b0/src/robust.jl#L514-L516

I do wonder if the MArray version would have been able to scale beyond 32 elements. MArrays require a bit of care while coding, but since this code is in the hot loop that care seems warranted.

Lastly the TLS caching approach seems to complicated for it's own good. It is probably better and faster to just hit the allocator like @dgleich code does. Especially if you are doing an undef alloc and not a zeroed out alloc.

DanielVandH · 2024-08-02T07:34:34Z

I do wonder if the MArray version would have been able to scale beyond 32 elements.

Can the code handle the creation of an MArray with thousands of elements? I've never used an MVector so I'm not familiar.

"all uses of the MVector must be non-escaping e.g. inlined"

I don't know how this would be guaranteed in general.

Lastly the TLS caching approach seems to complicated for it's own good. It is probably better and faster to just hit the allocator like @dgleich code does. Especially if you are doing an undef alloc and not a zeroed out alloc.

I'm happy to look into it again, but when I benchmarked it while implementing it caching was fine.

vchuravy · 2024-08-02T07:47:41Z

I don't know how this would be guaranteed in general.

The simple pattern I use is: Allocate within a function and if I return a MVector return an SArray of it instead.

DanielVandH · 2024-08-02T07:50:38Z

You mean e.g. the functions where a cached h is provided, provide a Val(N) (or maybe just an N), do h = MVector{...}(...) and then at the end return SVector(h) and make use of that?

DanielVandH · 2024-08-02T08:48:42Z

I think either way caching is fine from what I have benchmarked, I don't see the complications. It's not really relevant to the idea of trying to do the most with 32-length Tuples before trying to go further, which is all I intend to do for now. A separate issue for that can be made if desired.

vchuravy · 2024-08-02T10:02:49Z

I don't see the complications.

Looking at I wonder why we have struct that now contain both NTuples and Vec.
https://github.com/JuliaGeometry/AdaptivePredicates.jl/blob/main/src/caches.jl

The code is less clear and the compiler has to do more work to first get rid of Orient2Cache and then the ntuple.

I think the strategy of going to a Vector when we are past 32 or 64 elements in an ntuple is fine.
But right now we are paying the dict lookup cost on every call to Orient3Cache 12 times + 1 TLS lookup.

So for me the complications are that the code has gotten more complex through the introduction of a cache,
instead of less so. (I tried intentionally to keep it very similar to the C version).

I would simply allocate an array at those sizes and not do any form of caching. It ought to be the "rare" case and the array allocation is fast on Julia 1.11 Vector{Float64}(undef, N) for N <= 223 takes < 14ns; After that the cost grows a bit and levels out around 50ns. A TLS access is ~14ns a dictionary lookup is around the same. So for me the complexity cost is not worth it and I rather keep the code closer to the C version.

DanielVandH · 2024-08-02T10:16:44Z

The code is still very similar to the C version. I'll think about your approach again, then, but I don't think the performance will change drastically if at all - the adaptive methods are only even needed rarely.

My plans for now will be to just try a vector for > 32 elements, and later I can consider this and see what happens to the implementation.

If you want to make a PR rather than wait for me to get to it that's fine too. For me it's really only worth considering if it does actually make a difference inside DelaunayTriangulation.jl or Delaunator.jl (or any other algorithms making heavy use of this).

dgleich · 2024-08-02T13:34:10Z

Since I still had the code up ... using MVectors instead of the any type of caching...

julia> @btime Delaunator.incircle((1.0+eps(1.0),eps(1.0)),(eps(1.0)*1.33,eps(1.0)*-1.33),(eps(1.0),1.0-eps(1.0)),(1.0,0))
  643.245 ns (1 allocation: 9.06 KiB)
-1.9721522630525293e-31

julia> @btime triangulate($robustness1);
  149.919 μs (143 allocations: 1.05 MiB)

julia> @btime triangulate($robustness2);
  620.479 μs (318 allocations: 2.70 MiB)

This uses ntuples for <= 32 and MVector >= 48 ...

Just for completeness, here's the routine.

function _incircleadapt_round1(pa, pb, pc, pd, permanent, cache=nothing)
  @assert(cache === nothing)
  T = Float64
  @inbounds begin
    adx = pa[1] - pd[1]
    bdx = pb[1] - pd[1]
    cdx = pc[1] - pd[1]
    ady = pa[2] - pd[2]
    bdy = pb[2] - pd[2]
    cdy = pc[2] - pd[2]

    bdxcdy1, bdxcdy0 = Two_Product(bdx, cdy)
    cdxbdy1, cdxbdy0 = Two_Product(cdx, bdy)
    bc3, bc2, bc1, bc0 = Two_Two_Diff(bdxcdy1, bdxcdy0, cdxbdy1, cdxbdy0)
    bc = (bc0, bc1, bc2, bc3)
    axbc, axbclen = scale_expansion_zeroelim(4, bc, adx, Val(8))
    axxbc, axxbclen = scale_expansion_zeroelim(axbclen, axbc, adx, Val(16))
    aybc, aybclen = scale_expansion_zeroelim(4, bc, ady, Val(8))
    ayybc, ayybclen = scale_expansion_zeroelim(aybclen, aybc, ady, Val(16))
    adet, alen = fast_expansion_sum_zeroelim(axxbclen, axxbc, ayybclen, ayybc, Val(32))

    cdxady1, cdxady0 = Two_Product(cdx, ady)
    adxcdy1, adxcdy0 = Two_Product(adx, cdy)
    ca3, ca2, ca1, ca0 = Two_Two_Diff(cdxady1, cdxady0, adxcdy1, adxcdy0)
    ca = (ca0, ca1, ca2, ca3)
    bxca, bxcalen = scale_expansion_zeroelim(4, ca, bdx, Val(8))
    bxxca, bxxcalen = scale_expansion_zeroelim(bxcalen, bxca, bdx, Val(16))
    byca, bycalen = scale_expansion_zeroelim(4, ca, bdy, Val(8))
    byyca, byycalen = scale_expansion_zeroelim(bycalen, byca, bdy, Val(16))
    bdet, blen = fast_expansion_sum_zeroelim(bxxcalen, bxxca, byycalen, byyca, Val(32))

    adxbdy1, adxbdy0 = Two_Product(adx, bdy)
    bdxady1, bdxady0 = Two_Product(bdx, ady)
    ab3, ab2, ab1, ab0 = Two_Two_Diff(adxbdy1, adxbdy0, bdxady1, bdxady0)
    ab = (ab0, ab1, ab2, ab3)
    cxab, cxablen = scale_expansion_zeroelim(4, ab, cdx, Val(8))
    cxxab, cxxablen = scale_expansion_zeroelim(cxablen, cxab, cdx, Val(16))
    cyab, cyablen = scale_expansion_zeroelim(4, ab, cdy, Val(8))
    cyyab, cyyablen = scale_expansion_zeroelim(cyablen, cyab, cdy, Val(16))
    cdet, clen = fast_expansion_sum_zeroelim(cxxablen, cxxab, cyyablen, cyyab, Val(32))

    # Julia can handle up to Val(32) without any allocations or weird behavior
    # so let's try and check if we can solve this within Val(32)... 
    # if not, then we will fall back to the full expansion... 
    abdet, ablen = fast_expansion_sum_zeroelim(alen, adet, blen, bdet, MVector{64,Float64}(undef))
    fin1, finlength = fast_expansion_sum_zeroelim(ablen, abdet, clen, cdet, MVector{1152,Float64}(undef))

    # recheck termination condition...
    det = estimate(finlength, fin1)
    errbound = iccerrboundB * permanent
    if (det ≥ errbound) || (-det ≥ errbound)
        return det
    end

    adxtail = Two_Diff_Tail(pa[1], pd[1], adx)
    adytail = Two_Diff_Tail(pa[2], pd[2], ady)
    bdxtail = Two_Diff_Tail(pb[1], pd[1], bdx)
    bdytail = Two_Diff_Tail(pb[2], pd[2], bdy)
    cdxtail = Two_Diff_Tail(pc[1], pd[1], cdx)
    cdytail = Two_Diff_Tail(pc[2], pd[2], cdy)

    if iszero(adxtail) && iszero(bdxtail) && iszero(cdxtail) &&
      iszero(adytail) && iszero(bdytail) && iszero(cdytail)
        return det
    end

    errbound = iccerrboundC * permanent + resulterrbound * Absolute(det)
    detadd = ((adx * adx + ady * ady) * ((bdx * cdytail + cdy * bdxtail) -
                                        (bdy * cdxtail + cdx * bdytail)) +
              2.0 * (adx * adxtail + ady * adytail) * (bdx * cdy - bdy * cdx)) +
            ((bdx * bdx + bdy * bdy) * ((cdx * adytail + ady * cdxtail) -
                                        (cdy * adxtail + adx * cdytail)) +
              2.0 * (bdx * bdxtail + bdy * bdytail) * (cdx * ady - cdy * adx)) +
            ((cdx * cdx + cdy * cdy) * ((adx * bdytail + bdy * adxtail) -
                                        (ady * bdxtail + bdx * adytail)) +
              2.0 * (cdx * cdxtail + cdy * cdytail) * (adx * bdy - ady * bdx))
    det = T(det + detadd) # Had to change this to match how C handles the 2.0 multiplication with Float32
    if (det ≥ errbound) || (-det ≥ errbound)
        return det
    end

    finnow = fin1
    finother = MVector{1152,Float64}(undef)
    h48 = MVector{48,Float64}(undef)
    h64 = MVector{64,Float64}(undef)

    if !iszero(bdxtail) || !iszero(bdytail) || !iszero(cdxtail) || !iszero(cdytail)
        adxadx1, adxadx0 = Square(adx)
        adyady1, adyady0 = Square(ady)
        aa3, aa2, aa1, aa0 = Two_Two_Sum(adxadx1, adxadx0, adyady1, adyady0)
        aa = (aa0, aa1, aa2, aa3)
    end
    if !iszero(cdxtail) || !iszero(cdytail) || !iszero(adxtail) || !iszero(adytail)
        bdxbdx1, bdxbdx0 = Square(bdx)
        bdybdy1, bdybdy0 = Square(bdy)
        bb3, bb2, bb1, bb0 = Two_Two_Sum(bdxbdx1, bdxbdx0, bdybdy1, bdybdy0)
        bb = (bb0, bb1, bb2, bb3)
    end
    if !iszero(adxtail) || !iszero(adytail) || !iszero(bdxtail) || !iszero(bdytail)
        cdxcdx1, cdxcdx0 = Square(cdx)
        cdycdy1, cdycdy0 = Square(cdy)
        cc3, cc2, cc1, cc0 = Two_Two_Sum(cdxcdx1, cdxcdx0, cdycdy1, cdycdy0)
        cc = (cc0, cc1, cc2, cc3)
    end

    if !iszero(adxtail)
        axtbc, axtbclen = scale_expansion_zeroelim(4, bc, adxtail, Val(8))
        temp16a, temp16alen = scale_expansion_zeroelim(axtbclen, axtbc, 2adx, Val(16))

        axtcc, axtcclen = scale_expansion_zeroelim(4, cc, adxtail, Val(8))
        temp16b, temp16blen = scale_expansion_zeroelim(axtcclen, axtcc, bdy, Val(16))

        axtbb, axtbblen = scale_expansion_zeroelim(4, bb, adxtail, Val(8))
        temp16c, temp16clen = scale_expansion_zeroelim(axtbblen, axtbb, -cdy, Val(16))

        temp32a, temp32alen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
        temp48, temp48len = fast_expansion_sum_zeroelim(temp16clen, temp16c, temp32alen, temp32a, h48)
        finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
        finnow, finother = finother, finnow
    end
    if !iszero(adytail)
        aytbc, aytbclen = scale_expansion_zeroelim(4, bc, adytail, Val(8))
        temp16a, temp16alen = scale_expansion_zeroelim(aytbclen, aytbc, 2 * ady, Val(16))

        aytbb, aytbblen = scale_expansion_zeroelim(4, bb, adytail, Val(8))
        temp16b, temp16blen = scale_expansion_zeroelim(aytbblen, aytbb, cdx, Val(16))

        aytcc, aytcclen = scale_expansion_zeroelim(4, cc, adytail, Val(8))
        temp16c, temp16clen = scale_expansion_zeroelim(aytcclen, aytcc, -bdx, Val(16))

        temp32a, temp32alen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
        temp48, temp48len = fast_expansion_sum_zeroelim(temp16clen, temp16c, temp32alen, temp32a, h48)
        finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
        finnow, finother = finother, finnow
    end
    if !iszero(bdxtail)
        bxtca, bxtcalen = scale_expansion_zeroelim(4, ca, bdxtail, Val(8))
        temp16a, temp16alen = scale_expansion_zeroelim(bxtcalen, bxtca, 2 * bdx, Val(16))

        bxtaa, bxtaalen = scale_expansion_zeroelim(4, aa, bdxtail, Val(8))
        temp16b, temp16blen = scale_expansion_zeroelim(bxtaalen, bxtaa, cdy, Val(16))

        bxtcc, bxtcclen = scale_expansion_zeroelim(4, cc, bdxtail, Val(8))
        temp16c, temp16clen = scale_expansion_zeroelim(bxtcclen, bxtcc, -ady, Val(16))

        temp32a, temp32alen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
        temp48, temp48len = fast_expansion_sum_zeroelim(temp16clen, temp16c, temp32alen, temp32a, h48)
        finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
        finnow, finother = finother, finnow
    end
    if !iszero(bdytail)
        bytca, bytcalen = scale_expansion_zeroelim(4, ca, bdytail, Val(8))
        temp16a, temp16alen = scale_expansion_zeroelim(bytcalen, bytca, 2 * bdy, Val(16))

        bytcc, bytcclen = scale_expansion_zeroelim(4, cc, bdytail, Val(8))
        temp16b, temp16blen = scale_expansion_zeroelim(bytcclen, bytcc, adx, Val(16))

        bytaa, bytaalen = scale_expansion_zeroelim(4, aa, bdytail, Val(8))
        temp16c, temp16clen = scale_expansion_zeroelim(bytaalen, bytaa, -cdx, Val(16))

        temp32a, temp32alen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
        temp48, temp48len = fast_expansion_sum_zeroelim(temp16clen, temp16c, temp32alen, temp32a, h48)
        finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
        finnow, finother = finother, finnow
    end
    if !iszero(cdxtail)
        cxtab, cxtablen = scale_expansion_zeroelim(4, ab, cdxtail, Val(8))
        temp16a, temp16alen = scale_expansion_zeroelim(cxtablen, cxtab, 2 * cdx, Val(16))

        cxtbb, cxtbblen = scale_expansion_zeroelim(4, bb, cdxtail, Val(8))
        temp16b, temp16blen = scale_expansion_zeroelim(cxtbblen, cxtbb, ady, Val(16))

        cxtaa, cxtaalen = scale_expansion_zeroelim(4, aa, cdxtail, Val(8))
        temp16c, temp16clen = scale_expansion_zeroelim(cxtaalen, cxtaa, -bdy, Val(16))

        temp32a, temp32alen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
        temp48, temp48len = fast_expansion_sum_zeroelim(temp16clen, temp16c, temp32alen, temp32a, h48)
        finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
        finnow, finother = finother, finnow
    end
    if !iszero(cdytail)
        cytab, cytablen = scale_expansion_zeroelim(4, ab, cdytail, Val(8))
        temp16a, temp16alen = scale_expansion_zeroelim(cytablen, cytab, 2 * cdy, Val(16))

        cytaa, cytaalen = scale_expansion_zeroelim(4, aa, cdytail, Val(8))
        temp16b, temp16blen = scale_expansion_zeroelim(cytaalen, cytaa, bdx, Val(16))

        cytbb, cytbblen = scale_expansion_zeroelim(4, bb, cdytail, Val(8))
        temp16c, temp16clen = scale_expansion_zeroelim(cytbblen, cytbb, -adx, Val(16))

        temp32a, temp32alen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
        temp48, temp48len = fast_expansion_sum_zeroelim(temp16clen, temp16c, temp32alen, temp32a, h48)
        finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
        finnow, finother = finother, finnow
    end

    if !iszero(adxtail) || !iszero(adytail)
        if !iszero(bdxtail) || !iszero(bdytail) || !iszero(cdxtail) || !iszero(cdytail)
            ti1, ti0 = Two_Product(bdxtail, cdy)
            tj1, tj0 = Two_Product(bdx, cdytail)
            u3, u2, u1, u0 = Two_Two_Sum(ti1, ti0, tj1, tj0)
            u = (u0, u1, u2, u3)
            negate = -bdy
            ti1, ti0 = Two_Product(cdxtail, negate)
            negate = -bdytail
            tj1, tj0 = Two_Product(cdx, negate)
            v3, v2, v1, v0 = Two_Two_Sum(ti1, ti0, tj1, tj0)
            v = (v0, v1, v2, v3)
            bct, bctlen = fast_expansion_sum_zeroelim(4, u, 4, v, Val(8))

            ti1, ti0 = Two_Product(bdxtail, cdytail)
            tj1, tj0 = Two_Product(cdxtail, bdytail)
            bctt3, bctt2, bctt1, bctt0 = Two_Two_Diff(ti1, ti0, tj1, tj0)
            bctt = (bctt0, bctt1, bctt2, bctt3)
            bcttlen = 4
        else
            bct = (zero(T), zero(T), zero(T), zero(T), zero(T), zero(T), zero(T), zero(T))
            bctlen = 1
            bctt = (zero(T), zero(T), zero(T), zero(T))
            bcttlen = 1
        end

        if !iszero(adxtail)
            temp16a, temp16alen = scale_expansion_zeroelim(axtbclen, axtbc, adxtail, Val(16))
            axtbct, axtbctlen = scale_expansion_zeroelim(bctlen, bct, adxtail, Val(16))
            temp32a, temp32alen = scale_expansion_zeroelim(axtbctlen, axtbct, 2adx, Val(32))
            temp48, temp48len = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp32alen, temp32a, h48)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
            finnow, finother = finother, finnow
            if !iszero(bdytail)
                temp8, temp8len = scale_expansion_zeroelim(4, cc, adxtail, Val(8))
                temp16a, temp16alen = scale_expansion_zeroelim(temp8len, temp8, bdytail, Val(16))
                finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp16alen, temp16a, finother)
                finnow, finother = finother, finnow
            end
            if !iszero(cdytail)
                temp8, temp8len = scale_expansion_zeroelim(4, bb, -adxtail, Val(8))
                temp16a, temp16alen = scale_expansion_zeroelim(temp8len, temp8, cdytail, Val(16))
                finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp16alen, temp16a, finother)
                finnow, finother = finother, finnow
            end

            temp32a, temp32alen = scale_expansion_zeroelim(axtbctlen, axtbct, adxtail, Val(32))
            axtbctt, axtbcttlen = scale_expansion_zeroelim(bcttlen, bctt, adxtail, Val(8))
            temp16a, temp16alen = scale_expansion_zeroelim(axtbcttlen, axtbctt, 2adx, Val(16))
            temp16b, temp16blen = scale_expansion_zeroelim(axtbcttlen, axtbctt, adxtail, Val(16))
            temp32b, temp32blen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
            temp64, temp64len = fast_expansion_sum_zeroelim(temp32alen, temp32a, temp32blen, temp32b, h64)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp64len, temp64, finother)
            finnow, finother = finother, finnow
        end
        if !iszero(adytail)
            temp16a, temp16alen = scale_expansion_zeroelim(aytbclen, aytbc, adytail, Val(16))
            aytbct, aytbctlen = scale_expansion_zeroelim(bctlen, bct, adytail, Val(16))
            temp32a, temp32alen = scale_expansion_zeroelim(aytbctlen, aytbct, 2ady, Val(32))
            temp48, temp48len = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp32alen, temp32a, h48)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
            finnow, finother = finother, finnow

            temp32a, temp32alen = scale_expansion_zeroelim(aytbctlen, aytbct, adytail, Val(32))
            aytbctt, aytbcttlen = scale_expansion_zeroelim(bcttlen, bctt, adytail, Val(8))
            temp16a, temp16alen = scale_expansion_zeroelim(aytbcttlen, aytbctt, 2ady, Val(16))
            temp16b, temp16blen = scale_expansion_zeroelim(aytbcttlen, aytbctt, adytail, Val(16))
            temp32b, temp32blen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
            temp64, temp64len = fast_expansion_sum_zeroelim(temp32alen, temp32a, temp32blen, temp32b, h64)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp64len, temp64, finother)
            finnow, finother = finother, finnow
        end
    end

    if !iszero(bdxtail) || !iszero(bdytail)
        if !iszero(cdxtail) || !iszero(cdytail) || !iszero(adxtail) || !iszero(adytail)
            ti1, ti0 = Two_Product(cdxtail, ady)
            tj1, tj0 = Two_Product(cdx, adytail)
            u3, u2, u1, u0 = Two_Two_Sum(ti1, ti0, tj1, tj0)
            u = (u0, u1, u2, u3)
            negate = -cdy
            ti1, ti0 = Two_Product(adxtail, negate)
            negate = -cdytail
            tj1, tj0 = Two_Product(adx, negate)
            v3, v2, v1, v0 = Two_Two_Sum(ti1, ti0, tj1, tj0)
            v = (v0, v1, v2, v3)
            cat, catlen = fast_expansion_sum_zeroelim(4, u, 4, v, Val(8))

            ti1, ti0 = Two_Product(cdxtail, adytail)
            tj1, tj0 = Two_Product(adxtail, cdytail)
            catt3, catt2, catt1, catt0 = Two_Two_Diff(ti1, ti0, tj1, tj0)
            catt = (catt0, catt1, catt2, catt3)
            cattlen = 4
        else
            cat = (zero(T), zero(T), zero(T), zero(T), zero(T), zero(T), zero(T), zero(T))
            catlen = 1
            catt = (zero(T), zero(T), zero(T), zero(T))
            cattlen = 1
        end

        if !iszero(bdxtail)
            temp16a, temp16alen = scale_expansion_zeroelim(bxtcalen, bxtca, bdxtail, Val(16))
            bxtcat, bxtcatlen = scale_expansion_zeroelim(catlen, cat, bdxtail, Val(16))
            temp32a, temp32alen = scale_expansion_zeroelim(bxtcatlen, bxtcat, 2bdx, Val(32))
            temp48, temp48len = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp32alen, temp32a, h48)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
            finnow, finother = finother, finnow
            if !iszero(cdytail)
                temp8, temp8len = scale_expansion_zeroelim(4, aa, bdxtail, Val(8))
                temp16a, temp16alen = scale_expansion_zeroelim(temp8len, temp8, cdytail, Val(16))
                finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp16alen, temp16a, finother)
                finnow, finother = finother, finnow
            end
            if !iszero(adytail)
                temp8, temp8len = scale_expansion_zeroelim(4, cc, -bdxtail, Val(8))
                temp16a, temp16alen = scale_expansion_zeroelim(temp8len, temp8, adytail, Val(16))
                finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp16alen, temp16a, finother)
                finnow, finother = finother, finnow
            end

            temp32a, temp32alen = scale_expansion_zeroelim(bxtcatlen, bxtcat, bdxtail, Val(32))
            bxtcatt, bxtcattlen = scale_expansion_zeroelim(cattlen, catt, bdxtail, Val(8))
            temp16a, temp16alen = scale_expansion_zeroelim(bxtcattlen, bxtcatt, 2bdx, Val(16))
            temp16b, temp16blen = scale_expansion_zeroelim(bxtcattlen, bxtcatt, bdxtail, Val(16))
            temp32b, temp32blen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
            temp64, temp64len = fast_expansion_sum_zeroelim(temp32alen, temp32a, temp32blen, temp32b, h64)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp64len, temp64, finother)
            finnow, finother = finother, finnow
        end
        if !iszero(bdytail)
            temp16a, temp16alen = scale_expansion_zeroelim(bytcalen, bytca, bdytail, Val(16))
            bytcat, bytcatlen = scale_expansion_zeroelim(catlen, cat, bdytail, Val(16))
            temp32a, temp32alen = scale_expansion_zeroelim(bytcatlen, bytcat, 2bdy, Val(32))
            temp48, temp48len = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp32alen, temp32a, h48)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
            finnow, finother = finother, finnow

            temp32a, temp32alen = scale_expansion_zeroelim(bytcatlen, bytcat, bdytail, Val(32))
            bytcatt, bytcattlen = scale_expansion_zeroelim(cattlen, catt, bdytail, Val(8))
            temp16a, temp16alen = scale_expansion_zeroelim(bytcattlen, bytcatt, 2bdy, Val(16))
            temp16b, temp16blen = scale_expansion_zeroelim(bytcattlen, bytcatt, bdytail, Val(16))
            temp32b, temp32blen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
            temp64, temp64len = fast_expansion_sum_zeroelim(temp32alen, temp32a, temp32blen, temp32b, h64)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp64len, temp64, finother)
            finnow, finother = finother, finnow
        end
    end

    if !iszero(cdxtail) || !iszero(cdytail)
        if !iszero(adxtail) || !iszero(adytail) || !iszero(bdxtail) || !iszero(bdytail)
            ti1, ti0 = Two_Product(adxtail, bdy)
            tj1, tj0 = Two_Product(adx, bdytail)
            u3, u2, u1, u0 = Two_Two_Sum(ti1, ti0, tj1, tj0)
            u = (u0, u1, u2, u3)
            negate = -ady
            ti1, ti0 = Two_Product(bdxtail, negate)
            negate = -adytail
            tj1, tj0 = Two_Product(bdx, negate)
            v3, v2, v1, v0 = Two_Two_Sum(ti1, ti0, tj1, tj0)
            v = (v0, v1, v2, v3)
            abt, abtlen = fast_expansion_sum_zeroelim(4, u, 4, v, Val(8))

            ti1, ti0 = Two_Product(adxtail, bdytail)
            tj1, tj0 = Two_Product(bdxtail, adytail)
            abtt3, abtt2, abtt1, abtt0 = Two_Two_Diff(ti1, ti0, tj1, tj0)
            abtt = (abtt0, abtt1, abtt2, abtt3)
            abttlen = 4
        else
            abt = (zero(T), zero(T), zero(T), zero(T), zero(T), zero(T), zero(T), zero(T))
            abtlen = 1
            abtt = (zero(T), zero(T), zero(T), zero(T))
            abttlen = 1
        end

        if !iszero(cdxtail)
            temp16a, temp16alen = scale_expansion_zeroelim(cxtablen, cxtab, cdxtail, Val(16))
            cxtabt, cxtabtlen = scale_expansion_zeroelim(abtlen, abt, cdxtail, Val(16))
            temp32a, temp32alen = scale_expansion_zeroelim(cxtabtlen, cxtabt, 2cdx, Val(32))
            temp48, temp48len = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp32alen, temp32a, h48)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
            finnow, finother = finother, finnow
            if !iszero(adytail)
                temp8, temp8len = scale_expansion_zeroelim(4, bb, cdxtail, Val(8))
                temp16a, temp16alen = scale_expansion_zeroelim(temp8len, temp8, adytail, Val(16))
                finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp16alen, temp16a, finother)
                finnow, finother = finother, finnow
            end
            if !iszero(bdytail)
                temp8, temp8len = scale_expansion_zeroelim(4, aa, -cdxtail, Val(8))
                temp16a, temp16alen = scale_expansion_zeroelim(temp8len, temp8, bdytail, Val(16))
                finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp16alen, temp16a, finother)
                finnow, finother = finother, finnow
            end

            temp32a, temp32alen = scale_expansion_zeroelim(cxtabtlen, cxtabt, cdxtail, Val(32))
            cxtabtt, cxtabttlen = scale_expansion_zeroelim(abttlen, abtt, cdxtail, Val(8))
            temp16a, temp16alen = scale_expansion_zeroelim(cxtabttlen, cxtabtt, 2cdx, Val(16))
            temp16b, temp16blen = scale_expansion_zeroelim(cxtabttlen, cxtabtt, cdxtail, Val(16))
            temp32b, temp32blen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
            temp64, temp64len = fast_expansion_sum_zeroelim(temp32alen, temp32a, temp32blen, temp32b, h64)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp64len, temp64, finother)
            finnow, finother = finother, finnow
        end
        if !iszero(cdytail)
            temp16a, temp16alen = scale_expansion_zeroelim(cytablen, cytab, cdytail, Val(16))
            cytabt, cytabtlen = scale_expansion_zeroelim(abtlen, abt, cdytail, Val(16))
            temp32a, temp32alen = scale_expansion_zeroelim(cytabtlen, cytabt, 2cdy, Val(32))
            temp48, temp48len = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp32alen, temp32a, h48)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp48len, temp48, finother)
            finnow, finother = finother, finnow

            temp32a, temp32alen = scale_expansion_zeroelim(cytabtlen, cytabt, cdytail, Val(32))
            cytabtt, cytabttlen = scale_expansion_zeroelim(abttlen, abtt, cdytail, Val(8))
            temp16a, temp16alen = scale_expansion_zeroelim(cytabttlen, cytabtt, 2cdy, Val(16))
            temp16b, temp16blen = scale_expansion_zeroelim(cytabttlen, cytabtt, cdytail, Val(16))
            temp32b, temp32blen = fast_expansion_sum_zeroelim(temp16alen, temp16a, temp16blen, temp16b, Val(32))
            temp64, temp64len = fast_expansion_sum_zeroelim(temp32alen, temp32a, temp32blen, temp32b, h64)
            finother, finlength = fast_expansion_sum_zeroelim(finlength, finnow, temp64len, temp64, finother)
            finnow, finother = finother, finnow
        end
    end
    return finnow[finlength]
  end
end

DanielVandH · 2024-09-15T09:59:26Z

#21 addresses this

DanielVandH mentioned this issue Sep 15, 2024

Remove cache structs and use Tuples, allocating only when necessary #21

Merged

DanielVandH linked a pull request Sep 15, 2024 that will close this issue

Remove cache structs and use Tuples, allocating only when necessary #21

Merged

DanielVandH closed this as completed in #21 Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incircle performance may be able to be faster... #20

incircle performance may be able to be faster... #20

dgleich commented Jul 30, 2024

DanielVandH commented Jul 30, 2024 •

edited

Loading

dgleich commented Jul 30, 2024

dgleich commented Jul 31, 2024

DanielVandH commented Jul 31, 2024

vchuravy commented Aug 2, 2024

DanielVandH commented Aug 2, 2024 •

edited

Loading

vchuravy commented Aug 2, 2024

DanielVandH commented Aug 2, 2024

DanielVandH commented Aug 2, 2024

vchuravy commented Aug 2, 2024

DanielVandH commented Aug 2, 2024 •

edited

Loading

dgleich commented Aug 2, 2024

DanielVandH commented Sep 15, 2024

incircle performance may be able to be faster... #20

incircle performance may be able to be faster... #20

Comments

dgleich commented Jul 30, 2024

DanielVandH commented Jul 30, 2024 • edited Loading

dgleich commented Jul 30, 2024

dgleich commented Jul 31, 2024

DanielVandH commented Jul 31, 2024

vchuravy commented Aug 2, 2024

DanielVandH commented Aug 2, 2024 • edited Loading

vchuravy commented Aug 2, 2024

DanielVandH commented Aug 2, 2024

DanielVandH commented Aug 2, 2024

vchuravy commented Aug 2, 2024

DanielVandH commented Aug 2, 2024 • edited Loading

dgleich commented Aug 2, 2024

DanielVandH commented Sep 15, 2024

DanielVandH commented Jul 30, 2024 •

edited

Loading

DanielVandH commented Aug 2, 2024 •

edited

Loading

DanielVandH commented Aug 2, 2024 •

edited

Loading