Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Julia Implementation #79

Merged
merged 27 commits into from
Nov 7, 2023
Merged

WIP: Julia Implementation #79

merged 27 commits into from
Nov 7, 2023

Conversation

cortner
Copy link
Member

@cortner cortner commented Oct 28, 2023

This PR implements a Julia package SpheriCart.jl as discussed in #23. Note this is MIT licensed. In the API I tried to mimic the Python interface but adapt it to Julia style. The low-level interface is more C like but also using some Julia conveniences. I could get rid of that if desired.

Done so far:

  • fast generated kernels for single inputs
  • generic kernels for many inputs, SIMD vectorized
  • basic docs in README.md

TODO to merge PR:

  • gradients
  • add documentation in the main docs pages
  • test that results exactly match the Python / C++ implementation. I would appreciate help with this.
  • spherical harmonics

For the last point, it may be necessary to fix my basis ordering. I haven't yet looked into what ordering you use. But it would be a trivial change. EDIT: it seems the ordering is fine.

TODO FUTURE:

  • careful performance comparison and see where more could be done to optimize the Julia codes. In principle it should be possible to match C++ performance.
  • ChainRules.jl compatibility
  • multi-threaded kernels for larger input batches
  • GPU kernels.
  • better generic kernels for single inputs (to discuss); Generate optimized polynomial recursion for small L and make all codes use that optimized start.
  • multi-threading

I propose that "TODO" need to be finished to merge this but the "TODO FUTURE" points can be added in future PRs.

@cortner
Copy link
Member Author

cortner commented Oct 28, 2023

@tjjarvinen -- you might be interested in keeping an eye on this and look into translating the GPU kernels once the CPU kernels are merged.

@frostedoyster
Copy link
Collaborator

frostedoyster commented Oct 28, 2023

Thanks again for the PR @cortner!
Regarding some of the points above:

  • I don't think there would be any issues with the Julia conveniences

  • Of course we can help with the testing. Another route to test the Julia functions would be to use a native Julia scientific package as the baseline (for example, we test our Python implementations against SciPy)

  • The output array layout we use is [sample, sh_index] where sh_index is a combined (l, m) index with lexicographic ordering: (0, 0), (1, -1), (1, 0), (1, 1), (2, -2), (2, -1), (2, 0),... I had a very quick look and it seems like you're already using the same ordering, but I could be wrong. For the derivatives, we add the extra dimensions in the middle: our gradients are [sample, alpha, sh_index] and our Hessians are [sample, alpha, beta, sh_index], where alpha and beta are x, y, or z.

  • I think the separation between what needs to be done now and in the future is also sensible. Before we merge, I can help with the C++ benchmarks so we have a rough idea of where we are in terms of performance

I would also be happy to help with the documentation

@cortner
Copy link
Member Author

cortner commented Oct 28, 2023

Another route to test the Julia functions would be to use a native Julia scientific package as the baseline

I've already tested against symbolics up to L=4 + confirmed that the basis is orthonormal. That implies that its the same as yours up to a sign.

I had a very quick look and it seems like you're already using the same ordering, but I could be wrong

yes I am. thanks for confirming.

@cortner
Copy link
Member Author

cortner commented Oct 28, 2023

For the derivatives ...

In Julia I would simply return derivatives as SVector, this is the most natural thing to do. It might be equivalent to yours in terms of memory layout, we can look at that.

@cortner
Copy link
Member Author

cortner commented Oct 28, 2023

Before we merge, I can help with the C++ benchmarks so we have a rough idea of where we are in terms of performance

Thank you - that would be greatly appreciated. Even if just to document the differences initially. If the gap is bigger than expected then we can transfer your symbolic optimisations, or get more experienced Julia programmers to look at the reasons for the gap.

@frostedoyster
Copy link
Collaborator

In Julia I would simply return derivatives as SVector, this is the most natural thing to do. It might be equivalent to yours in terms of memory layout, we can look at that.

If this is an SVector of dynamically allocated vectors, I don't think it would be equivalent (the SVector would probably contain pointers that are contiguous themselves, but which point to different memory blocks in the heap). In any case, we think it's much more important to fit within the idiomatic use different languages rather than keeping consistency across the APIs. Let's go with SVectors

@frostedoyster
Copy link
Collaborator

frostedoyster commented Oct 28, 2023

Regarding the benchmarking, here is what worked for me:

(starting from the sphericart root directory)
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/.local -DSPHERICART_OPENMP=OFF -DSPHERICART_BUILD_EXAMPLES=ON
make install
cd benchmarks/
./benchmark_cpp

This builds sphericart with the C++ benchmark program and without multithreading.
The benchmark will output timings for l=10 as well as low l values (1 to 6)
You can change the requested high l from 10 to (for example) 15 by running ./benchmark_cpp -l 15.
Let me know if this works for you

@frostedoyster
Copy link
Collaborator

I'm getting, for l=6 and 10000 samples,

32 ns/sample (Julia)
33 ns/sample (C++ generic)
30 ns/sample (C++ with hardcoding)

on my laptop with a lot of other stuff going on (so the results are probably not very reliable). In any case, the speed of this implementation looks already very good.

@cortner
Copy link
Member Author

cortner commented Oct 28, 2023

Dont worry - I would never form a vector of dynamically allocated vectors, that's madness. I'll explain the memory Layout with SVectors later when I have time.

Thanks also for the timings and the advise on how to run benchmarks myself. I'll try to reproduce locally and on my groups server.

@cortner
Copy link
Member Author

cortner commented Oct 28, 2023

Id very much like to understand how you get that dort if timing for a generic implementation that doesn't that doesn't do static analysis and unrolling. Or does it?maybe your generic is closer to my generated and your generated is a bit of extra optimisation of the symbolic expressions on top?

@frostedoyster
Copy link
Collaborator

Our generic implementation is as you say: it doesn't assume anything. Instead, our hardcoded expressions up to l_max=6 are all templated, so that l_max is static and the loops should therefore be unrolled. If l_max>6, we split all loops into one that goes from 0 to 6 (unrolled) and one that goes from 7 to l_max (not unrolled). Based on this, your implementation might be faster than ours for l_max>6. I'd like to try, but I'm having issues at the moment on our clusters and I don't trust my laptop. It will be interesting to see the results on your clusters!

@frostedoyster
Copy link
Collaborator

frostedoyster commented Oct 28, 2023

In other words, we're tying together hardcoding and templating l_max for simplicity. If the Julia implementation is faster than the C++ one for l_max>6, it might be worth templating our l_max>6 spherical harmonics (even though they're not hardcoded) and see what happens

@cortner
Copy link
Member Author

cortner commented Oct 28, 2023

If l_max>6, we split all loops into one that goes from 0 to 6 (unrolled) and one that goes from 7 to l_max (not unrolled).

Very nice idea - I will add this.

@frostedoyster
Copy link
Collaborator

My laptop timings are completely inconsistent with those in our paper -- I wouldn't trust them. I'll try to get actual timings when they fix my cluster account, hopefully on Monday.

Copy link

@tjjarvinen tjjarvinen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are couple of things I noticed related to FLoat32 calculations.

SpheriCart.jl/src/generated_kernels.jl Outdated Show resolved Hide resolved
SpheriCart.jl/src/generated_kernels.jl Outdated Show resolved Hide resolved
SpheriCart.jl/LICENSE Outdated Show resolved Hide resolved
@frostedoyster
Copy link
Collaborator

I've finally managed to run some benchmarks on our cluster. Here are the results

L = 1
  2.894 ns (0 allocations: 0 bytes)
L = 2
  10.618 ns (0 allocations: 0 bytes)
L = 3
  12.914 ns (0 allocations: 0 bytes)
L = 4
  15.834 ns (0 allocations: 0 bytes)
L = 5
  21.852 ns (0 allocations: 0 bytes)
L = 6
  27.413 ns (0 allocations: 0 bytes)
L = 7
  36.759 ns (0 allocations: 0 bytes)
L = 8
  48.932 ns (0 allocations: 0 bytes)
L = 9
  56.446 ns (0 allocations: 0 bytes)
L = 10
  74.160 ns (0 allocations: 0 bytes)


L=1 (no h-c) values              took 3.57942 ± 0.0426515 ns / sample
L=1 hardcoded values             took 1.59056 ± 0.0712444 ns / sample

L=2 (no h-c) values              took 7.31367 ± 0.0653529 ns / sample
L=2 hardcoded values             took 3.52026 ± 0.0565254 ns / sample

L=3 (no h-c) values              took 12.3802 ± 0.122709 ns / sample
L=3 hardcoded values             took 11.9444 ± 0.0806885 ns / sample

L=4 (no h-c) values              took 18.8659 ± 0.13839 ns / sample
L=4 hardcoded values             took 11.1446 ± 0.116836 ns / sample

L=5 (no h-c) values              took 27.5947 ± 0.204012 ns / sample
L=5 hardcoded values             took 15.8352 ± 0.122492 ns / sample

L=6 (no h-c) values              took 39.2852 ± 0.214211 ns / sample
L=6 hardcoded values             took 23.9707 ± 0.161073 ns / sample

@cortner
Copy link
Member Author

cortner commented Nov 1, 2023

L = 1
  2.894 ns (0 allocations: 0 bytes)
L = 2
  10.618 ns (0 allocations: 0 bytes)
L = 3
  12.914 ns (0 allocations: 0 bytes)

This looks about right. On my machine it's

L = 1
  2.250 ns (0 allocations: 0 bytes)
L = 2
  4.708 ns (0 allocations: 0 bytes)
L = 3
  7.750 ns (0 allocations: 0 bytes)

If I generate optimized polynomial expressions symbolically I think I gain maybe another 30-40% performance. We could do this later, but for now I want to stick with this automatically generated version until it's finished, tested etc.

@cortner
Copy link
Member Author

cortner commented Nov 3, 2023

gradients for batched evaluation was done. When reading your paper I was very sceptical about your gradient timings, I thought they were poor. But I didn't want to say anything until I tried myself, I think while my evaluation is a bit slower than yours, my gradients are actually faster.

[ Info: compute! vs compute_with_gradients! (using 32 inputs)
L = 1
        compute!:   287.483 ns (0 allocations: 0 bytes)
 _with_gradient!:   350.118 ns (0 allocations: 0 bytes)
L = 2
        compute!:   311.536 ns (0 allocations: 0 bytes)
 _with_gradient!:   474.913 ns (0 allocations: 0 bytes)
L = 3
        compute!:   408.750 ns (0 allocations: 0 bytes)
 _with_gradient!:   672.276 ns (0 allocations: 0 bytes)
L = 4
        compute!:   532.723 ns (0 allocations: 0 bytes)
 _with_gradient!:   922.414 ns (0 allocations: 0 bytes)
L = 5
        compute!:   666.669 ns (0 allocations: 0 bytes)
 _with_gradient!:   1.246 μs (0 allocations: 0 bytes)
L = 6
        compute!:   828.526 ns (0 allocations: 0 bytes)
 _with_gradient!:   1.617 μs (0 allocations: 0 bytes)
L = 7
        compute!:   1.029 μs (0 allocations: 0 bytes)
 _with_gradient!:   2.056 μs (0 allocations: 0 bytes)
L = 8
        compute!:   1.250 μs (0 allocations: 0 bytes)
 _with_gradient!:   2.630 μs (0 allocations: 0 bytes)
L = 9
        compute!:   1.492 μs (0 allocations: 0 bytes)
 _with_gradient!:   3.401 μs (0 allocations: 0 bytes)
L = 10
        compute!:   1.725 μs (0 allocations: 0 bytes)
 _with_gradient!:   4.321 μs (0 allocations: 0 bytes)
L = 11
        compute!:   2.083 μs (0 allocations: 0 bytes)
 _with_gradient!:   5.299 μs (0 allocations: 0 bytes)
L = 12
        compute!:   2.370 μs (0 allocations: 0 bytes)
 _with_gradient!:   6.242 μs (0 allocations: 0 bytes)
L = 13
        compute!:   2.722 μs (0 allocations: 0 bytes)
 _with_gradient!:   7.239 μs (0 allocations: 0 bytes)
L = 14
        compute!:   3.125 μs (0 allocations: 0 bytes)
 _with_gradient!:   8.542 μs (0 allocations: 0 bytes)
L = 15
        compute!:   3.599 μs (0 allocations: 0 bytes)
 _with_gradient!:   9.208 μs (0 allocations: 0 bytes)

@cortner
Copy link
Member Author

cortner commented Nov 3, 2023

An interesting follow-up - my code-generated gradient have a similar performance loss (factor 4) as your "generic codes" - which I think means my generated codes must be quite similar in spirit to your generic codes.

[ Info: compute vs compute_with_gradients for code-generated basis
L = 1
        compute:   2.166 ns (0 allocations: 0 bytes)
 _with_gradient:   3.167 ns (0 allocations: 0 bytes)
L = 2
        compute:   4.625 ns (0 allocations: 0 bytes)
 _with_gradient:   10.678 ns (0 allocations: 0 bytes)
L = 3
        compute:   7.883 ns (0 allocations: 0 bytes)
 _with_gradient:   20.562 ns (0 allocations: 0 bytes)
L = 4
        compute:   12.012 ns (0 allocations: 0 bytes)
 _with_gradient:   35.030 ns (0 allocations: 0 bytes)
L = 5
        compute:   19.518 ns (0 allocations: 0 bytes)
 _with_gradient:   53.184 ns (0 allocations: 0 bytes)
L = 6
        compute:   26.759 ns (0 allocations: 0 bytes)
 _with_gradient:   78.890 ns (0 allocations: 0 bytes)
L = 7
        compute:   35.954 ns (0 allocations: 0 bytes)
 _with_gradient:   112.945 ns (0 allocations: 0 bytes)
L = 8
        compute:   41.540 ns (0 allocations: 0 bytes)
 _with_gradient:   150.104 ns (0 allocations: 0 bytes)
L = 9
        compute:   53.341 ns (0 allocations: 0 bytes)
 _with_gradient:   195.902 ns (0 allocations: 0 bytes)
L = 10
        compute:   62.628 ns (0 allocations: 0 bytes)
 _with_gradient:   260.484 ns (0 allocations: 0 bytes)

@cortner
Copy link
Member Author

cortner commented Nov 3, 2023

Maybe I should briefly explain Julia SVectors so the format how I return gradients is understood.

In Julia, a Vector{T} has has runtime determined length and is allocated on the heap. Those are slow to allocate. Conversely a SVector{N, T} has a compile-time determined length - namely N - and can therefore be allocated on the heap. Generating an SVector{N, T} or arithmetic with svectors is cheap.

Further, if I generate a vector of svectors, i.e. an object of type

dZ = Vector{SVector{3, T}}

then this will be allocated on the heap, but because each element dZ[i] is a bits-type, it is stored continuously in memory. In fact the memory layout for the dZ above is precisely the same as for a column major 3 x length matrix or row major length x 3 matrix.

@frostedoyster
Copy link
Collaborator

Ok, no problem. In my opinion, as long as the documentation is there in the readme, we can take the responsibility to port it to readthedocs in a PR in the near future, ideally with your review to make sure that the resulting documentation is correct. @Luthaf I would really like to hear your opinion as well here

@cortner
Copy link
Member Author

cortner commented Nov 6, 2023

I also have no problem transferring the README now to readthedocs, and add a link in the README to the documentation. But I don't want it twice.

I don't know how to have the nice automatic doc generation though. Maybe that part could be postponed?

@Luthaf
Copy link
Contributor

Luthaf commented Nov 6, 2023

Regarding readthedocs, I don't think they have a good integration with Documenter.jl, which is a bit sad. We could use github pages to host documentation instead, but we would loose documentation preview on PR.

Copy link
Contributor

@Luthaf Luthaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small point we can also resolve later. I don't think having the code in a folder named Sphericart.jl is required for registration. Given we have the python code in python directory, I would prefer to have this code in a julia directory at the root.

@@ -0,0 +1,19 @@
name = "SpheriCart"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I've always considered the package to be a single word, not a concatenation of two of them

Suggested change
name = "SpheriCart"
name = "Sphericart"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll only accept that change under pressure. It is clearly a concatenation of words and hence the C should be capitalised. It is not an "Eigenname".

the published paper and without consulting the reference implementation
provided by the authors.

`SpheriCart.jl` is released under MIT license.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`SpheriCart.jl` is released under MIT license.
`Sphericart.jl` is released under both MIT and Apache license.

@cortner
Copy link
Member Author

cortner commented Nov 6, 2023

This should not be left for later. It would be huge pain to change the registry manually since it would require manual merging from a maintainer.

I don't like this change either - but will accept it under sufficient pressure.

@Luthaf
Copy link
Contributor

Luthaf commented Nov 6, 2023

I would hope that the name of the source directory in the git repo where a given project can be found is part of the project version in the registry, and can be changed easily from one version to the next, but I'll check this. Do you know if there is an equivalent to test.pypi.org for the General registry by chance?

Re name and folder change, I'll ask for more opinions then. @frostedoyster and @ceriottm: should the package be called Sphericart.jl or SpheriCart.jl? And should the code live in Spheri{c,C}art.jl/ or julia/ directory?

@cortner
Copy link
Member Author

cortner commented Nov 6, 2023

I would hope that the name of the source directory in the git repo where a given project can be found is part of the project version in the registry, and can be changed easily from one version to the next

As I said above, it requires a manual PR and a manual merge from a highly overstreched group of maintainers.

@cortner
Copy link
Member Author

cortner commented Nov 6, 2023

Do you know if there is an equivalent to test.pypi.org for the General registry by chance?

If there is then I don't know about it. Julia packages tend to just add a little button on their github frontpage that reports the CI status. (tests passing/failing)

@cortner
Copy link
Member Author

cortner commented Nov 6, 2023

Regarding the naming : the Julia convention is very clear on this

  • to avoid acronyms as much as possible (your package name is borderline here but I adopted it to make it compatible and that argument will be fine)
  • to capitalize the first letter if it is a compound name.

regarding the folder name: I'm not thrilled with the change, it makes me uncomfortable for some reason, but I don't have any strong argument to not change it.

@Luthaf
Copy link
Contributor

Luthaf commented Nov 6, 2023

test.pypi.org does not show CI status, it is a secondary instance of the registry at pypi.org, where you can test that your upload procedure works, and you are producing the right artifacts. Any package uploaded there is removed automatically every so often, and it is solely intended for testing release procedure.

@cortner
Copy link
Member Author

cortner commented Nov 6, 2023

How is this different from CI? Julia gets installed, the package gets installed, the dependencies get installed (including artifacts) and then the tests are run.

@frostedoyster
Copy link
Collaborator

Sphericart.jl and SpheriCart.jl are both fine to me, and I'm also indifferent as to having julia or SpheriCart.jl as the name of the folder, because it's true that we have a python/ folder, but it's also true that we have sphericart-torch/ and sphericart-jax/.

@Luthaf
Copy link
Contributor

Luthaf commented Nov 6, 2023

How is this different from CI? Julia gets installed, the package gets installed, the dependencies get installed (including artifacts) and then the tests are run.

None of this happen on test.pypi.org. You just upload & register your package there, and it runs the same checks on the uploaded distributions as it would on the main registry. (Python distributions works a bit differently from julia packages, they are closer to jll packages since they can contain binaries). It does not run your package tests, it does not even tries to install it or its dependencies. It does statically check that all declared dependencies already exists on the main registry, and that all the metadata is set right. From a quick search, nothing of the sort seems to exist, which is fine!

It also seems that the sub directory is not recorded with the version (cf https://github.com/JuliaRegistries/General/blob/244be981c9030ee146c070d56811e02721e403ae/C/cuDNN/Package.toml#L4, this is not in Versions.toml), meaning that this is something we need to decide on now. Same for the package name, we need to decide before registering it.

@cortner
Copy link
Member Author

cortner commented Nov 6, 2023

From a quick search, nothing of the sort seems to exist, which is fine!

This would be the registrator bot. But it works a bit differently. In fact it is much more stringent I think. It will only merge the registration request automatically if all tests pass.

https://github.com/JuliaRegistries/Registrator.jl

@ceriottm
Copy link
Contributor

ceriottm commented Nov 7, 2023

No super-strong feelings here but my preference would be:

  • SpheriCart.jl as the package name (if there is a common capitalization rule in the community we should follow it, to avoid confusing users). It should be noted though that the idea for the name was also to play a bit on spheri-cart and spheric-art but I usually import sphericart as sc so I guess my subconscious sees it as a compound name
  • julia/ as the folder name for analogy with python/ as this is a choice that depends on the parent repo. Incidentally, we might also want to rename sphericart-torch and sphericart-jax to torch and jax if it can be done without much pain

@ceriottm ceriottm closed this Nov 7, 2023
@ceriottm ceriottm reopened this Nov 7, 2023
@frostedoyster
Copy link
Collaborator

Closing the pull request was a mistake, we're not sure how that happened. So, @cortner, we can rename the folder to julia/ (this can be reverted if it leads to any problems in the Julia ecosystem), leave the package name as SpheriCart.jl, and then merge as soon as a preliminary readme documentation is ready. We will later manage the integration with the rest of the documentation

@cortner
Copy link
Member Author

cortner commented Nov 7, 2023

No problem. Please take a look at the existing documentation in the README and tell me if you think it needs to be significantly expanded?

@frostedoyster
Copy link
Collaborator

I would say that we need at least a mention (or perhaps an example?) of SphericalHarmonics. Aside from that, it should do the job for now.

@frostedoyster
Copy link
Collaborator

It looks like the renaming of the folder caused some issues with the tests. I'm looking into it, should be fixed relatively soon

@frostedoyster
Copy link
Collaborator

Thank you very much for all this work @cortner! We might need your review later when we merge the documentation with our own

@frostedoyster frostedoyster merged commit b85b098 into lab-cosmo:main Nov 7, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants