Added support for GPU acceleration (CUDA) on recovery file creation. #176

RisaKirisu · 2022-11-10T08:28:25Z

Few months ago I used this software to create my backup, but I felt it was very slow.. At the same time, I was learning CUDA. So over the past few months, I learnt RS-encode, read the paper that this project is based on, and studied the code in this project. Then I wrote a CUDA compute routine for recovery file creation. In my test, the overall run time of the CUDA routine on my RTX 2080 Ti is around 4X faster than the OpenMP routine on my Ryzen 5800x (8 core, 16 threads).

Since Autotools is kinda a pain to add CUDA compilation support to, I wrote a new build script using CMake. User can compile with CUDA part enabled by passing ENABLE_CUDA=ON into cmake. When CUDA option is not enabled, the CMake script should produce the same program as the automake script.

I tested the CUDA routine by producing recovery files for random input files using both CUDA version and the latest release of par2cmdline, and then diff the recovery files produced by CUDA version and the ones produced by the latest release of par2cmdline. There is no difference.

Lastly, I'm a student and this is the first open source project I participated in. The code might not be perfect, but I'll try to fix any bug or bad coding style thing that comes up and I will try my best to learn.

500M input test (options are c -r30 -b4057)

20G input test (options are c -q -r30 -b4057)

…kernel launching

…t hash is computed incorrectly. It works now!

…rrectly. Now everything works.

… by accident)

…nvcc

animetosho · 2022-11-10T10:56:56Z

This looks quite interesting!

Not commenting on your code or pull request, but as an aside, if you're looking for faster PAR2, it's worth checking out MultiPar and ParPar.
Both these employ SIMD to be several times faster than par2cmdline, and the former includes an OpenCL implementation. I've also got an OpenCL implementation in ParPar, but it isn't enabled (I haven't been able to get great performance out of a GPU unfortunately).

I don't maintain par2cmdline (so again, not a comment on your pull request), but as a general suggestion, you may wish to consult with projects before implementing a large change - it could feel disheartening if it's ultimately not accepted after all. Also consider breaking it into smaller pull requests if possible, to be more palatable for maintainers to review.

Appreciate the contribution nonetheless!

RisaKirisu · 2022-11-10T16:21:07Z

@animetosho
Thank you for the comment. I checked out the projects you linked to, specifically the benchmark page you created has many helpful information. Indeed GF multiplication is the fundamental part of the program. The CUDA routine I implemented uses log and antilog tables for multiplication, and I was not aware of the shuffle and XOR algorithm you mentioned. Seems like they could bring much performance improvement. I'd like to study them and see if they can be implemented on GPU. Right now I expect about half of the log/antilog table look ups should have cache hit in GPU's L1 cache (because the factors doesn't change), but still there's a lot of random access outside of L1, so there's definitely improvements to be made.

When I started, I didn't know if I would be able to get this done, so perhaps I didn't have enough confidence and thus didn't consult with the project before I started. I will definitely keep that in mind.

animetosho · 2022-11-11T00:26:24Z

I wrote a summary of techniques for CPU here.

There's a paper for what I call "shuffle" here. This may be workable on a GPU, but I haven't tried, as OpenCL doesn't support warp shuffles (which shouldn't be a limitation in CUDA).

"XOR" is a technique I discovered - it somewhat resembles a Cauchy layout in GF-Complete. I wrote up about it here, but don't think it's workable on a GPU as it requires JIT for each multiplication. Perhaps you could try splitting the multiplication in half, to avoid the need for JIT, but performance might suffer a lot.

GPUs lack instructions for more elaborate techniques that can be done on a CPU (like polynomial-multiply on ARM or GF affine on x86), so you may find that basic algorithms work the best here.

I've played around with the log/exp technique. There's actually a few tweaks that can be done regarding it:

avoid exponentiating the factor, i.e. keep it in logarithm form - this saves a log table lookup when doing the multiply
when copying the input to local memory, do the logarithm lookup there - this removes the other log lookup during multiplication (which also means the log table doesn't need to consume cache)
with the above, the inner loop should mostly be an antilog lookup + xor accumulate

The antilog table consumes 128KB, which may be too large for cache. I looked at a split exponentiation technique, which reduces table size to 16.25KB (which should fit in cache), at the expense of more operations. Results seem to be mixed across GPUs I've tested.
Splitting the table works on the notion of 2^(a+b) = 2^a * 2^b and finding a fast way to do * 2^b. Essentially I do a lookup on the top 13 bits (i.e. 2^a), then use a second lookup for multiplying by the bottom 3 bits.

I also looked at halving the log table to 64KB, but it doesn't seem to be beneficial. Exponentiation is trivial to split, but it doesn't work as nicely for log.

However, it seems that the classic low/high split lookup often works best on GPU, from my testing. This is the same algorithm implemented in par2cmdline and MultiPar's OpenCL code.
A warp shuffle based implementation might be interesting to see however.

When I started, I didn't know if I would be able to get this done, so perhaps I didn't have enough confidence and thus didn't consult with the project before I started.

That's fair, though once you've done some work, there's little harm in asking. If you're still unsure, you could take a look at existing issues, to get a sense of project activity, the types of stuff that get accepted etc. At the end of the day, if you hope for your changes to be accepted, you'll need to confront the project maintainers at some point, so it might be better sooner than later.
I mostly point this out because par2cmdline hasn't historically been a performance-oriented implementation. I can't speak for the maintainers here, but it's something worth considering when submitting vendor-specific performance optimisations like this.

Hope that was helpful.

BlackIkeEagle · 2023-01-26T11:42:13Z

Thanks for being patient, I like the idea of CUDA acceleration. I currently lack hardware to test CUDA and also lack the time to go through such a huge MR.

Hence I have also updated the README that I'm looking for someone to take over the project

RisaKirisu · 2023-01-26T12:29:42Z

@BlackIkeEagle I'm sorry that I didn't discuss about this in an issue before doing it and making a large addition. So what would be a proper next step for this?

I later thought of many possible optimizations to this initial implementation of CUDA routine, mostly inspired by earlie replies from animetosho. However, I've been too busy with graduation and job finding stuffs that I haven't got to implement and test much of them. Though now I'm starting to have more free time I will resume working on this.

RisaKirisu added 27 commits July 1, 2022 00:39

pw: create initial cuda galois files

004115b

pw: add profiling workspace to gitignore

42d8881

pw: Adding galois arithmatics CUDA routines.

370f9c4

pw: reconsider code structure.

ce94b0c

pw: CUDA Galois class WIP.

d5beefc

pw: Galois_cu.cuh finished.

f089df6

pw: Update .gitignore

9d79147

pw: galosi_cu unit test WIP.

3612177

pw: CUDA Galois test mult finished.

795cb46

pw: Galois field arithmetics tested GOOD

2a33541

pw: WIP Cuda kernel for a cuda version of ReedSolomon::Process.

e4287a6

pw: reedsolomon process kernel finished.

a744848

pw: CUDA Reedsolomon Process function WIP

133a216

pw: ReedSolomon sequential kernel tested

0c413fc

pw: reedsolomon: reduce kernel completed and tested

ebda062

pw: modified reedsolomen cuda route to use cudaStream for concurrent …

a4509f3

…kernel launching

pw: integrate cuda accelerated creating route into the program

3a1f874

pw: Convert to CMake build system

6df08e8

pw: added cmake code for compiling cuda files. fixed a bug where inpu…

e889309

…t hash is computed incorrectly. It works now!

pw: fixed a bug where last chunks of recovery blocks are written inco…

89375e8

…rrectly. Now everything works.

Update README.md for cuda support

8b5ca60

pw: add install to CMakeLists. Enable longmultiply (was commented out…

74207e6

… by accident)

pw: add copyright stuff

f36dec7

pw: copyright stuff

60e13ab

pw: copyright stuff

798fad5

pw: copyright stuff

c568fea

pw: update program version and correct CUDA architecture version for …

41494cb

…nvcc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for GPU acceleration (CUDA) on recovery file creation. #176

Added support for GPU acceleration (CUDA) on recovery file creation. #176

RisaKirisu commented Nov 10, 2022

animetosho commented Nov 10, 2022

RisaKirisu commented Nov 10, 2022 •

edited

Loading

animetosho commented Nov 11, 2022 •

edited

Loading

BlackIkeEagle commented Jan 26, 2023

RisaKirisu commented Jan 26, 2023

Added support for GPU acceleration (CUDA) on recovery file creation. #176

Are you sure you want to change the base?

Added support for GPU acceleration (CUDA) on recovery file creation. #176

Conversation

RisaKirisu commented Nov 10, 2022

animetosho commented Nov 10, 2022

RisaKirisu commented Nov 10, 2022 • edited Loading

animetosho commented Nov 11, 2022 • edited Loading

BlackIkeEagle commented Jan 26, 2023

RisaKirisu commented Jan 26, 2023

RisaKirisu commented Nov 10, 2022 •

edited

Loading

animetosho commented Nov 11, 2022 •

edited

Loading