This project was made to learn from compute shaders and to have a reference for similar project.
It contains several implementations to see how some compare to others, check out the different folders in Assets
.
Better resolutions previews: Swarm - Skull - I Love Unity - Skull Mesh - Upvote - Moving Skull - Close Up Boids
Features
- Flocking behaviour
- Parameters: speed, size, rotation, radius check...
- Skinned Mesh Boid animation data used on GPU
- Vertex frame interpolation
- Affectors with force and distance
- Convert data points to drawing
- Bitonic sorting
How To Use
Start the sample scene AllFlocks
and run it.
Try out the different implementations by toggling the different gameobjects. Mess around with the settings to see what you can do with it and move around the gameobject so that your boids will follow it.
For custom drawings use my other project PathToPoints which converts an SVG file to a set of data points.
Benchmarks
Using a GTX 980 Ti
Implementation | 1000 Boids | 4000 Boids | 32000 Boids |
---|---|---|---|
CPU Flock | 20 FPS | 3 FPS | < 1 FPS |
CPU Draw/GPU Compute | 126 FPS | 14 FPS | < 1 FPS |
GPU Flock | > 1000 FPS | > 1000 FPS | 93 FPS |
GPU Flock multilateration | > 1000 FPS | 400 FPS | 42 FPS |
GPU Flock bitonic sorting | > 1000 FPS | 950 FPS | 20 FPS |
GPU Flock skinned and affectors | > 1000 FPS | > 1000 FPS | 80 FPS |
It seems my tests to optimize with different implementations failed and a brute for loop seems to be faster than any other method.
GPU Flock for each boid will check against every other boids if it's in its range, so we got a stable 32k loop every frame. Bitonic sorting on the other hand will average at 5k loop but still is slower, what's interesting it the fact that the bitonic sort does not seem to be the problem but the fact that each thread are accessing data at an offset instead at the beginning which means we have tons of cache miss on the GPU. Check out Boids_Bitonic.compute
for more infos, will be glad to have some feedback on that.
Compute Shaders
A few tips and notes about compute shaders.
Padding had a great impact on performance where I could increase my FPS by 10% at times. Strangely I read that padding to 16 bytes is what is suggested but in my experiments I had to add 4 to 8 additional bytes sometimes (see Boid_Simple.compute
vs Boid.compute
), anyone to shed light on this ?
An array access (like MyStructuredBuffer[instanceId]) is really costly so when I had to access my buffer more than once I logically cached it in a variable, but some of the time it was more performant to access it again without caching it, probably will depend of the size of your struct and the number of time you access it.
Do not use ComputeBuffer.GetData() it will tank your performance, try to like this project pass around values in buffers and things will become fast as hell. If you really have to then try out the experimental Async GetData().
Future
This GPU Flocking system is a great way to learn about compute shaders and is quite inexpensive to run for a few thousands units since it offload the work to the GPU and there is no readback to the CPU.
With the arrival of ECS and the Jobs system in Unity and the already impressive ground work made with the ECS flocking sample I think both systems are quite equivalent though the ECS one will have the advantage of ease of expansion and debugging which might make me write the same features from this system to the ECS one.
Requirements
- Tested on Unity 2017+ - Should work from Unity 5.6
- Platform that supports compute shaders (PC & Console)
Credits