-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asymptotic expansion for large orders. Besselk(nu, x) #8
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8 +/- ##
==========================================
+ Coverage 95.78% 98.26% +2.47%
==========================================
Files 11 11
Lines 475 575 +100
==========================================
+ Hits 455 565 +110
+ Misses 20 10 -10
Continue to review full report at Codecov.
|
To simplify this problem let's look at a function bk32(v, x::T) where T <: AbstractFloat
z = x / v
zs = sqrt(1 + z^2)
n = zs + log(z / (1 + zs))
coef = exp(-v * n) * sqrt(pi / (2*v)) / sqrt(zs)
p = inv(zs)
u0 = one(x)
u1 = p / 24 * (3 - 5*p^2) * -1 / v
u2 = p^2 / 1152 * (81 - 462*p^2 + 385*p^4) / v^2
u3 = p^3 / 414720 * (30375 - 369603 * p^2 + 765765*p^4 - 425425*p^6) * -1 / v^3
u4 = p^4 / 39813120 * (4465125 - 94121676*p^2 + 349922430*p^4 - 446185740*p^6 + 185910725*p^8) / v^4
u5 = p^5 / 6688604160 * (-188699385875*p^10 + 566098157625*p^8 - 614135872350*p^6 + 284499769554*p^4 - 49286948607*p^2 + 1519035525) * -1 / v^5
out = u0 + u1 + u2 + u3 + u4 + u5
return coef*out
end Where we add up the U polynomials in series. This actually provides good enough accuracy in the following example all the way up to ## Compare relative error / eps to ArbNumerics.jl
# using BigFloats as accuracy check
julia> abs(1 - bk32(BigFloat(200), BigFloat(100.0)) / besselk(200, ArbFloat(100.0, digits=70))) / eps(Float64)
0.93578
# using Float64
julia> abs(1 - bk32(Float64(200), Float64(100.0)) / besselk(200, ArbFloat(100.0, digits=70))) / eps(Float64)
77.7788
# using Float32
julia> abs(1 - bk32(Float32(200), Float32(100.0)) / besselk(200, ArbFloat(100.0, digits=70))) / eps(Float32)
229.8961 Though, it requires high precision to get accurate results...... function bk32_poly(v, x)
z = x / v
zs = sqrt(1 + z^2)
n = zs + log(z / (1 + zs))
n = exp(-v * n) * sqrt(pi / (2*v)) / sqrt(zs)
p = inv(zs)
poly_coeffs = (
1,
-1/(8*v),
9/(128*v^2),
(640*v^2 - 225)/(3072*v^3),
(11025 - 39424*v^2)/(98304*v^4),
(1168128*v^2 - 297675)/(1310720*v^5),
(123200*v^2 - 871497)/(368640*v^4),
(608480847 - 152472320*v^2)/(82575360*v^5),
144001/(16384*v^4),
(54454400*v^2 - 2257934679)/(53084160*v^5),
-7436429/(663552*v^4),
108313205/(1179648*v^5),
37182145/(7962624*v^4),
-5391411025/(63700992*v^5),
5391411025/(191102976*v^5)
)
return evalpoly(p, poly_coeffs) * n
end Though its results seem to suffer even worse and not even as accurate in high precision... Though this could be an error from me in setting the problem up.... have to double check. However, it produces the same results for ## Compare relative error / eps to ArbNumerics.jl
# using BigFloats as check
julia> abs(1 - bk32_poly(BigFloat(200), BigFloat(100.0)) / besselk(200, ArbFloat(100.0, digits=70))) / eps(Float64)
8790.07158
# using Float64
julia> abs(1 - bk32_poly(Float64(200), Float64(100.0)) / besselk(200, ArbFloat(100.0, digits=70))) / eps(Float64)
8712.076
# using Float32
julia> abs(1 - bk32_poly(Float32(200), Float32(100.0)) / besselk(200, ArbFloat(100.0, digits=70))) / eps(Float32)
229.8961288
@oscardssmith Do you have thoughts on best way to compute |
What about
|
Also, for |
Ya that is really impressive!! I like how we can keep coefficients all exact. And very fast all things considered, while providing similar errors! julia> abs(1 - bk32(BigFloat(200), BigFloat(100.0)) / besselk(200, ArbFloat(100.0, digits=70)))
2.077863330447360112e-16
julia> abs(1 - bk32_new(BigFloat(200), BigFloat(100.0)) / besselk(200, ArbFloat(100.0, digits=70)))
2.077760317644734432e-16 So we do lose some accuracy when doing it in julia> abs(1 - bk32_new(Float64(200), Float64(100.0)) / besselk(200, ArbFloat(100.0, digits=70)))
1.69273029004913010678098354694302671752893517308314639783117236510783504027076e-14 And julia> abs(1 - bk32_new(Float32(200), Float32(100.0)) / besselk(200, ArbFloat(100.0, digits=70)))
2.74075466924646385826117391071280851161159713028415252243504344281333675837797e-5 Edit: And that sounds good for your previous comment! I think we can use even less U polynomials depending on where we want our expansion to be then. |
Also for |
Sounds great. I am quickly checking it over and looks like It looks like the biggest loss is coming in the evalpoly for the higher U polynomials. For example. #Float 64
julia> v = 100.0; x = 10.0
julia> p2 = v^2/fma(max(v,x), max(v,x), min(v,x)^2) # rounded correctly
0.9900990099009901
julia> u8 = 1 / 22191183337881600 * evalpoly(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
0.0021583037489126965
#BigFloat
julia> v = BigFloat(100.0); x = BigFloat(10.0)
julia> p2 = v^2/fma(max(v,x), max(v,x), min(v,x)^2) # rounded correctly
0.9900990099009900990099009900990099009900990099009900990099009900990099009901022
julia> u8 = 1 / 22191183337881600 * evalpoly(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
0.002158303737181252245696702046689380173846888351401502380264769750067125428000861 So The Float32 implementation is working great. We only need 4 U polynomials to get accurately rounded results. The cutoff of where we want to switch over from forward recurrence can happen as low as So we will need to use this routine to generate |
@oscardssmith I'm just seeing your discourse post on more accurate evalpoly routines https://discourse.julialang.org/t/more-accurate-evalpoly/45932. Did you pursue those routines any further? Looks like there is another reference here https://arxiv.org/pdf/cs/0610122.pdf. Do you think the compensated horner would be a good candidate here? I'm unsure if there's another way around this. |
Base.Math.ext_horner as of 1.7 has a partially compensated horner. I'll write you a fully compensated one tonight. the partially compensated one is about 3x slower than uncompensated, and the fully compensated is roughly 5x slower. the difference is that the partial compensation is only fully accurate if the higher order terms converge to 0. |
Oh, I already put the fully compensated version in the discourse issue you linked. Here it is for reference.
|
It looks like the issue is when p2 = 0.98
julia> evalpoly(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
2.05403516225376e14
julia> evalpoly(BigFloat(p2), (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
2.054035159478429426618583143791227160776082859513700290192954935066092917184701e+14
julia> exthorner(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
(2.05403516225376e14, -45981.289898326024) I might not be using julia> p2 = 5.0
5.0
julia> evalpoly(BigFloat(p2), (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
6.12479961153197164448576e+25
julia> evalpoly(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
6.124799611531971e25 So the issue is |
Part of the problem is that the coefficients you are using are too large to be losslessly represented by |
It looks like the largest errors are coming from calculating z = x / v
zs = sqrt(1 + z^2)
n = zs + log(z / (1 + zs))
coef = exp(-v * n) * sqrt(pi / (2*v)) / sqrt(zs) In some cases this produces only 12-13 correct digits. When this block is done in BigFloats it looks like we can get the right answer. Edit:After some more digging the calculation of julia> exp(28.509909247209745)
2.4082205921371577e12
julia> exp(big"28.5099092472097450000000000000000000000000000000")
2.408220592137158798062714571311086664973698363776289190685153859052305376026371e+12 Also, using a hard coded value for sqrt(pi / 2) is best (it isn't rounded correct). New code bit here. z = x/v
zs = hypot(1, z)
n = zs + log(z) - log1p(zs)
coef = 1.2533141373155003 * sqrt(inv(v)) * exp(-v*n) / sqrt(zs) |
If you want, I could update |
Ok so the good news is I'm almost positive the evalpolys are not the main factor. Its the early part calculating the coefficients. If I do those calculations in function besselk_large_orders(v, x::T) where T <: AbstractFloat
x = BigFloat(x); v = BigFloat(v)
z = x/v
zs = hypot(1, z)
n = zs + log(z) - log1p(zs)
coef = 1.2533141373155003 * sqrt(inv(v)) * exp(-v*n) / sqrt(zs)
p = inv(zs)
p2 = v^2/fma(max(v,x), max(v,x), min(v,x)^2)
p2 = Float64(p2); p = Float64(p); coef=Float64(coef); v = Float64(v)
u0 = one(Float64)
u1 = 1 / 24 * evalpoly(p2, (3, -5))
u2 = 1 / 1152 * evalpoly(p2, (81, -462, 385))
u3 = 1 / 414720 * evalpoly(p2, (30375, -369603, 765765, -425425))
u4 = 1 / 39813120 * evalpoly(p2, (4465125, -94121676, 349922430, -446185740, 185910725))
u5 = 1 / 6688604160 * evalpoly(p2, (1519035525, -49286948607, 284499769554, -614135872350, 566098157625, -188699385875))
u6 = 1 / 4815794995200 * evalpoly(p2, (2757049477875, -127577298354750, 1050760774457901, -3369032068261860,5104696716244125, -3685299006138750, 1023694168371875))
u7 = 1 / 115579079884800 * evalpoly(p2, (199689155040375, -12493049053044375, 138799253740521843, -613221795981706275, 1347119637570231525, -1570320948552481125, 931766432052080625, -221849150488590625))
u8 = 1 / 22191183337881600 * evalpoly(p2, (134790179652253125, -10960565081605263000, 157768535329832893644, -914113758588905038248, 2711772922412520971550, -4513690624987320777000, 4272845805510421639500, -2152114239059719935000, 448357133137441653125))
return coef*evalpoly(-p/v, (u0, u1, u2, u3, u4, u5, u6, u7, u8))
end So essentially just converting the numbers to big floats and then converting back to julia> besselk_large_orders(100.0, 100.0)
7.617129630494086e-25
julia> besselk(100, ArbFloat(100.0))
7.6171296304940854161829684946539327153e-25
#for reference this is it without converting to BigFloats
julia> besselk_large_orders(100.0, 100.0)
7.617129630493957e-25 So the main piece to try and compute as exact as possible is z = x/v
zs = hypot(1, z)
n = zs + log(z) - log1p(zs)
coef = 1.2533141373155003 * sqrt(inv(v)) * exp(-v*n) / sqrt(zs) |
I've looked a little bit for better ways to compute this, but I think that you pretty much have to use higher precision. That said, DoubleDouble would be a decent amount faster than |
So testing this the relative errors using pure Do we think doing the earlier computations in |
Theoretically, I would expect |
Edit: and for Hoping there is a much better way to do this because its a considerable performance hit (128x).... here code snippets I'm testing. function d64(v, x::T) where T <: AbstractFloat
z = Double64(x) / v
zs = hypot(1, z)
n = zs + log(z) - log1p(zs)
coef = 1.2533141373155003 * sqrt(inv(v)) * exp(-v*n) / sqrt(zs)
p = inv(zs)
p2 = v^2/fma(max(v,x), max(v,x), min(v,x)^2)
p2_64 = T(p2)
p64 = T(p)
coef64 = T(coef)
u0 = one(Float64)
u1 = 1 / 24 * evalpoly(p2_64, (3, -5))
u2 = 1 / 1152 * evalpoly(p2_64, (81, -462, 385))
u3 = 1 / 414720 * evalpoly(p2_64, (30375, -369603, 765765, -425425))
u4 = 1 / 39813120 * evalpoly(p2_64, (4465125, -94121676, 349922430, -446185740, 185910725))
return coef64*evalpoly(-p64 / v, (u0, u1, u2, u3, u4))
end
function f64(v, x::T) where T <: AbstractFloat
z = x/v
zs = hypot(1, z)
n = zs + log(z) - log1p(zs)
coef = 1.2533141373155003 * sqrt(inv(v)) * exp(-v*n) / sqrt(zs)
p = inv(zs)
p2 = v^2/fma(max(v,x), max(v,x), min(v,x)^2)
u0 = one(Float64)
u1 = 1 / 24 * evalpoly(p2, (3, -5))
u2 = 1 / 1152 * evalpoly(p2, (81, -462, 385))
u3 = 1 / 414720 * evalpoly(p2, (30375, -369603, 765765, -425425))
u4 = 1 / 39813120 * evalpoly(p2, (4465125, -94121676, 349922430, -446185740, 185910725))
return coef*evalpoly(-p/v, (u0, u1, u2, u3, u4))
end And here are the benchmarks julia> @benchmark d64(100, x) setup=(x=rand()*100)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
Range (min … max): 5.817 μs … 35.292 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.717 μs ┊ GC (median): 0.00%
Time (mean ± σ): 7.761 μs ± 482.594 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▆█▄▃
▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▃▅██████▇▅▄▃▃▂▂▂▃▃▂▃▂▂ ▃
5.82 μs Histogram: frequency by time 8.45 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark f64(100, x) setup=(x=rand()*100)
BenchmarkTools.Trial: 10000 samples with 982 evaluations.
Range (min … max): 58.851 ns … 155.974 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 61.695 ns ┊ GC (median): 0.00%
Time (mean ± σ): 62.423 ns ± 3.852 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▇▄ ▃▃▄█▇▂▂▂▂▂▅▃▃▂▁▁▁▂▁▁ ▂
▄███▇█████████████████████▇▆▆▆▆▆▇▆▆▆▆▇▇▇▆▇▆▅▆▅▅▅▅▆▅▄▄▅▅▅▅▄▅▅ █
58.9 ns Histogram: log(frequency) by time 78.2 ns <
Memory estimate: 0 bytes, allocs estimate: 0. Edit 2: For reference |
is about 2.2x faster than your version of the |
Looks great!! It is also just as fast for me to just do the whole routine in I think I'm not too keen on giving up such a large performance hit though for that added digit.... considering this routine is such a base routine to generate all of |
It appears that all of the time is being spent in the |
I'm less convinced an easy speedup exists now. |
JuliaMath/DoubleFloats.jl#136 is 10x faster. Not sure of as easy a path to get |
That sounds good and thanks for that PR! I'm thinking I will need to do some more testing for higher precision (either |
I've made a few changes after some testing. It appears the U polynomials are great and converge fairly rapidly for large orders. However, they are not easily computed with limited floating point digits. Thus, they remain a fast way to compute large orders with about 1e-14 relative error. This is consistent with the fortran routines provided by SpecialFunctions. Now I will note two concerns where the accuracy may suffer. For large arguments and small x. So The second concern is what to do in the middle ranges of For There are two ways going forward I see. For I think getting the algorithms as accurate as we can first and then I can benchmark different algorithms for accuracy and speed as we go through iterations.. Though testing all algorithms is definitely time prohibitive. |
Wow, this gets incredibly complex. Have you seen https://fredrikj.net/blog/2021/12/computing-special-functions-using-numerical-integration/ by the way? It suggests that at least for arbitrary precision, numerical integration can be a fast approach. I'm not sure if it is worth it for the type of precision we are looking at here. |
Yes, I think I'm going to make schematics for all these branches to show where we are switching over at. I guess they are called Special functions for a reason. Creating SIMDable functions might be tricky but if you need these for a vector of But yes, I think for I think for now getting good methods for |
That plan sounds good. |
This is a rough first draft at implementing a more efficient algorithm for large orders. This uses equation 10.41.4 from https://dlmf.nist.gov/10.41. It is also found in Abramowitz and Stegun Handbook of mathematical functions equation 9.7.8. This also appears to be the base algorithm from the Amos fortran routines (https://dl.acm.org/doi/10.1145/7921.214331).
This expansion also can be used for besseli and then use downward recurrence for all order of besseli.
The key trick here is determining at what order is it appropriate to use these. For larger orders the less terms you need in the sum but the more loss in time you have for the recurrence. Finding the optimal point will be key.
On the other hand, calculating the U polynomials is an absolute pain and they can't be found anywhere. They require recurrence with both integration and derivatives. These were mostly hand calculated and still double checking through them.
The current implementation is not at all optimized and leads to SEVERE rounding errors. I am having to put big floats through them to pass the tests.
Ideally the U polynomials can be rearranged to just get their coefficients and then use horner's scheme....... I'm still working through that but we have to be very careful how things are rounded for this to work.