Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize get_ith #286

Merged
merged 6 commits into from
Feb 23, 2024
Merged

optimize get_ith #286

merged 6 commits into from
Feb 23, 2024

Conversation

aplavin
Copy link
Member

@aplavin aplavin commented Nov 21, 2023

Makes getindex and related functions much faster on arrays with a few tens of components.

Narrow arrays:

julia> a1 = StructArray(x=1:10, y=1:10, z=1:10)
julia> @btime $a1[2]
# before
  1.625 ns (0 allocations: 0 bytes)
# after
  1.666 ns (0 allocations: 0 bytes)

Intermediate arrays:

julia> a2 = StructArray(;[Symbol("x$i") => 1:10 for i in 1:50]...)

julia> @btime $a2[2]
# before
  42.166 μs (854 allocations: 45.86 KiB)
# after
  3.182 μs (56 allocations: 4.91 KiB)

I guess originally there was some reason why the current implementation was chosen instead of a more naive one? The latter turns out to be faster now.

@aplavin
Copy link
Member Author

aplavin commented Nov 21, 2023

Nightly failures unrelated

@N5N3
Copy link
Contributor

N5N3 commented Nov 24, 2023

We just met a Vararg length limitation here. Tuple reducer using Base.tail plays bad for these long inputs as expected.

It should be noticed that the current version gives a stronger promising on inlining. As map has no @inline, and callsite inlining is not supported before 1.8.

There's also some concern on inferability on nested StructArray, though it should be usually ok in this case.

@aplavin
Copy link
Member Author

aplavin commented Nov 24, 2023

some concern on inferability on nested StructArray

Could you please suggest a few @inferred tests to add to the testsuite, so that to check for that? Not just for this PR, but for anything that happens in the future as well. Otherwise it's very easy to forget!

the current version gives a stronger promising on inlining

Are you talking about the @inline annotation? Docs say that it doesn't force nor guarantee anything: it's an "extra nudge" as docs phrase it. Also they say small functions typically don't need this annotation at all...
map on Tuples is very fundamental in Julia, pretty sure its performance is kept in check.

@N5N3
Copy link
Contributor

N5N3 commented Nov 24, 2023

I believe map itself is not designed for handling this kind of long tuples though. It would give up rescusion and fallback to a collect to Vector{Any} algorithm which punishes inference. You need ntuple or generated function to forse unroll in this case.

As for inlining, the current code would has more possibility to be inlined. In fact, getindex is not that cheap, especially for AbstractArray. So I think we'd better remove all inline blocker here.

@aplavin
Copy link
Member Author

aplavin commented Nov 24, 2023

Indeed, sounds like ntuple() should bring the best from both worlds here :) Will update the PR, always forget about that function.

And would be really great if you could suggest more inference or related tests for StructArrays, so that we could catch (some) of these potential issues automatically in the future.

@N5N3
Copy link
Contributor

N5N3 commented Nov 24, 2023

ntuple is also a inline blocker here. I think a StructArray of SMatrix{4,4} (or bigger) is enough to see the difference.

In fact, the limitation comes from Base.argtail. So the simpliest fix here is defining tail ourselves.

As for inference, I believe a nested StructArray with 4 or 5 layers is enough to check stability. Though I wound if this has a realistic usage (just like the big struct in MWE). IIUC
, Array should be preferred for these cases.

@piever
Copy link
Collaborator

piever commented Nov 30, 2023

I also seem to remember we switched to this version as a map-based (or rather ntuple-based) get_ith caused problem in CUDA kernels (see #114 and #87). We should probably add tests that StructArrays work in CUDA kernels, but I'm unsure what is a way to do that without complicating CI (I don't think the standard GitHub CI has access to a GPU).

@aplavin
Copy link
Member Author

aplavin commented Nov 30, 2023

Looks like lots of stuff is interconnected here! Wide, deep, GPU arrays...
Maybe just do @generated get_ith then, should always be no-overhead and work with any GPU type?

Pinging @vchuravy @lcw (those from the PR you linked): do you guys have any suggestions?

@aplavin
Copy link
Member Author

aplavin commented Dec 9, 2023

Pushed a straightforward @generated implementation: seems to solve all mentioned issues!

Don't know how to add something relevant to tests, but here are benchmarks:

A = StructArray([SMatrix{6,6}(rand(6,6)) for _ in 1:10])
@btime $A[5]
# before:
# 6.948 μs (155 allocations: 5.36 KiB)
# after:
# 1.021 μs (39 allocations: 1.45 KiB)

B = StructArray(a=StructArray(b=StructArray(c=StructArray(d=StructArray(e=StructArray(f=1:10))))))
@btime $B[5]
# before:
# 1.500 ns (0 allocations: 0 bytes)
# after:
# 1.500 ns (0 allocations: 0 bytes)

C = StructArray(;[Symbol("x$i") => 1:10 for i in 1:50]...)
@btime $C[5]
# before:
# 42.375 μs (854 allocations: 45.86 KiB)
# after:
# 1.092 μs (3 allocations: 1.31 KiB)

@N5N3
Copy link
Contributor

N5N3 commented Dec 9, 2023

So there are still remaining allocations? Perhaps it's another similar problem caused by splatted Tuple. I guess you need to fix that too to make your case fully optimized.

If we design to optimize this package for eltype with big struct, it would be good to ensure it works in more cases. Just for example, apparently inference would be blocked by map in _getindex below.

@aplavin
Copy link
Member Author

aplavin commented Dec 9, 2023

That's a bit above my head – tried to specialize more functions involved, but it didn't help. Suggestions welcome!

And this PR is a strict improvement anyway, with more than an order of magnitude speedup. This can easily turn the overhead from "dominates the profview" to "barely visible there".

@lcw
Copy link
Contributor

lcw commented Dec 11, 2023

I haven't tested it but the new @generated version of the code looks great. It should work on the GPU as well.

src/structarray.jl Outdated Show resolved Hide resolved
src/structarray.jl Outdated Show resolved Hide resolved
@piever
Copy link
Collaborator

piever commented Dec 14, 2023

I also like the generated function approach, it does seem very simple and elegant. I've left a couple minor suggestions, but overall looks good to me.

aplavin and others added 2 commits December 26, 2023 08:20
@aplavin
Copy link
Member Author

aplavin commented Dec 26, 2023

Followed those comments, should be ready!
Still, not sure if something relevant is possible to put into tests, but that would be great.

@aplavin
Copy link
Member Author

aplavin commented Jan 12, 2024

gentle bump...

@aplavin
Copy link
Member Author

aplavin commented Feb 14, 2024

bump

@aplavin aplavin merged commit 3880403 into JuliaArrays:master Feb 23, 2024
7 checks passed
@aplavin
Copy link
Member Author

aplavin commented Feb 23, 2024

I'll take the liberty to merge this, following an advice on slack #arrays. The PR didn't get a strong "no" from maintainers, changes are nonbreaking, and quite local (ie not an overhaul).
Please let me know if anything is wrong, it's the first time I do this in JuliArrays.

@aplavin aplavin deleted the patch-1 branch June 15, 2024 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants