Talk presented at PyHEP2021
Abstract: Execution speed is critical for code developed for high energy physics (HEP) research. HEP experiments are typically highly demanding in terms on computing power. The LHC experiments uses a computing grid, the Worldwide LHC computing grid, with one million computer cores to process their data. In this talk we will investigate the potential of the Julia programming language for HEP data analysis. Julia is a high-level and high-performance programming language that provides at the same time, ease of code development similar to Python and running performance similar to C, C++, and Fortran. It offers the same level of abstraction as Python, an interpreter-like experience based on a similar technique as the interpreter of ROOT, and a Jupyter notebook kernel. Results of performance measurements specific to HEP applications with a comparison with Python and C++ will also be presented.
The presentation was written as a Jupyter(lab) notebook. For an interactive presentation, click on the binder badge above or here. The static version follows. The presentation notebook (01-Julia-dream-PyHEP2021.ipynb) and the accompagning ones can be downloaded from this repository.
-
Julia is a new programming language
-
I've discovered it while making a survey of programming language for the projet of CUPID neutrinoless double-beta decay experiment, in last April
- It's a new programming language for me
- I've learned it for this talk and it was as fast as learning Python (actually faster because of their similarities)
-
I will present Julia to you in the perspective of using it for HEP applications
- Computing plays a central role in the research done at LHC
- Theoretical predictions: simulation of the proton-proton collisions
- Simulation of the detector response
- Reconstruction of the phyics events
- Analysis of reconstructed events to perform measurements and new physics search.
- Behind an LHC experiment result publication there are tens of thousands of computing jobs that have run on the worldwide computing grid. Both ATLAS and CMS reached their 1000th papers in June 2020.
- C++ is widely used and the performance it offers is essential
- Research code: developped by the main authors of the prepared publication
- Python is attractive becasue of its easy/fast code writing it offers and the library ecosystem that comes with it and we are many to use it. But it does not meet the performance provided by C++
- Used in conjunction of C++
- Huge development under way to leverage the performance of Python in terms of code running speed
- More attracive would be a Programming language that offers at the same time the C++ and Python pros
Fast/easy coding fast | Fast running | |
---|---|---|
Python | ⇔ | C/C++ |
-
J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah tackled the problem in 2009 aiming to design a programing langue that providess both Fast/easy coding AND Fast running
- Birth of Julia, release 0.1 in 2013
- This breakthrough was recognized by awards attributed to the authors
-
In 12 years since its conceptualisation, Julia has been improved from release to release and has aggratated many package developpers
-
Julia for HEP: Performance of Julia for High Energy Physics Analyses, Marcel Stanitzki & Jan Strube
HEP enjoys loop: we loop on physics events to loop on particles/physics objects. We often perform particle matching and clustering and for this we loop on events to loop on objects to loop on objects.
for event in billions_of_lhc_events
for tens_or_hundreds_of_objects in event
for tens_or_hundres_of_objects_to_match in event
...
end
end
end
- The outter loops hides several loops: datasets > files
- This is repeated several times for each analysis.
⇒ Lines of code executed billions of times even for a Kleenex code, written specially for a publication.
- A master rule for high-performance code in Python is to avoid writing loop in Python
- ⇒ push the loop to underlying compiled libraries. Approach of the
numpy
vectorisation.
- ⇒ push the loop to underlying compiled libraries. Approach of the
Let's run a simple loop
Python |
---|
90 ms |
#include <iostream>
#include <sys/time.h>x
int main(){
struct timeval t0, t1;
gettimeofday(&t0, 0);
double a = 0.;
for(unsigned i = 0; i <= 1000000; ++i) a += 1.0/i;
std::cout << "Computation Result: " << a << "\n";
gettimeofday(&t1, 0);
std::cerr << "Duration: " << (t1.tv_sec-t0.tv_sec)
+ 1.e-6*(t1.tv_usec-t0.tv_usec)
<< " seconds\n";
return 0;
}
run(`g++ -Wall -o simple-loop simple-loop.cc`)
run(`./simple-loop`)
;
run(`g++ -O3 -Wall -o simple-loop simple-loop.cc`)
run(`./simple-loop`)
;
Computation Result: 14.3927
Duration: 0.003335 seconds
Computation Result: 14.3927
Duration: 0.001558 seconds
Python | C++ |
---|---|
90 ms | 1.5 ms |
#
# Julia
#
function f()
a = 0.0
for i in 1:1_000_000 # ✨ Note the underscores that improves legibility
a = a + 1.0/i
end
return a
end
f()
@time b = f()
0.001533 seconds (1 allocation: 16 bytes)
14.392726722864989
Python | C++ | Julia |
---|---|---|
90ms | 1.5 ms | 1.6 ms |
- You have already seen in the previous example that the code syntax and grammar is similar to Pythons. No ''std::map<std:string, std::vector>''..., no compilation step.
Internet search engine and stack overflow play is an essential ingredient in nowadays programming workflow.
Julia is already widespread enough, to find all the information on the Internet.
Go to DuckDuckGo or your prefered search engine and make a try.
- Large set of libraries and active developement
- Julia is firstly used by scientific community ⇒ oriented to our needs
- I did the following game during the wokshop: I've looked for a Julia equivalent each time a speaker mention a Python library (apart from HEP specific ones).
- Caveat: I've not checked that it covers all the features of the Python package
- The results of this survey shows the large activity around Julia
cmd (Olivier Mattelaer) ✓
FreeCAD interface (Christophe Delaere) ❌ In discussion
Telegram bot (Matias Senge) ✓ https://github.com/Arkoniak/Telegram.jl
DataFrames (Vincenzo Eduardo Padulano) ✓
Spark (Vicenzo and Andr F.) ✓ https://github.com/dfdx/Spark.jl
Dask (Vincenzo E. P., Graham Markal) ✓
Batch computing (Vincenzo E. P.) ✓ https://docs.julialang.org/en/v1/manual/parallel-computing/, https://github.com/JuliaParallel, https://juliagpu.org/
Apache Parquet (Andre Frankenthal) ✓
Jupyter/Binder/SWAN ✓
Bokeh (Bruno Alves) ✓ https://github.com/samuelcolvin/Bokeh.jl
CUDA (Graham Markall) ✓ https://juliagpu.org/cuda/
Hypothesis (Santam Roy Choudhury, property testing) ❌ (besised an unmaintained QuickCheck projet)
Virtualenv (Henry Schneider, Packaging talk) ✓ built in the std package manager
Unit test tools (Henry Schneider, Packaging talk) ✓ std package and more: Coverage, FactCheck
JIT/Numba (Graham Markal, Henry Schneider) ✓ Intrisic to the language
Machine learning ✓ Flux, JuliaML,
TensorFlow(Matthew Feickert)TensorFlow
GPyTorch/Gaussian Process ML (Irina Espejo Morales talk) GPML
- Non-HEP format
- HDF5 and Parquet are fully supported (also CSV and Excel, less relevant our data size)
- ROOT
- Missing a solid and performant library to read ROOT file and with xroot support.
- Two packages developped by users.
- Emacs and vim support
- Atom and VScode support. Many features. Code can be run and debugged with the IDE, with support for plots.
- Jupyter
- Pluto. A new generation notebook with automatic update of cells.
- Debugger: Debugger, Rebugger, Juno debugger (for Atom IDE)
- Python made it easy with conda and pip. It's even easier in Julia
- A standard library part of the Julia installation
- Give instructions to the user, when he or she try to import a missing package. Try it:
import Blink
ArgumentError: Package Blink not found in current path:
- Run `import Pkg; Pkg.add("Blink")` to install the Blink package.
Stacktrace:
[1] require(into::Module, mod::Symbol)
@ Base ./loading.jl:871
[2] eval
@ ./boot.jl:360 [inlined]
[3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
@ Base ./loading.jl:1094
💡 Dedicated command mode for package handling in the REPL (interactive terminal application equivalent ipython):
julia> ]
(@v1.6) pkg> add Blink
(@v1.6) pkg> [Backspace]
julia>
- Python, C, Fortran code: direct call from Julia and Jupyter Julia kernels
- C++ code: call via a wrapper. Lacking a tool for automatic generation of wrapper like swig. Project for direct-call (ala cppyy) on hold and not working for recent versions of Julia.
The other way around
- Python code can call Julia as well
As simple as calling Julia code
# Enable Python call:
using PyCall
# Inport a python module:
math = pyimport("math")
# Use it as a Julia module:
math.sin(math.pi / 4)
0.7071067811865475
path = ccall(:getenv, Cstring, (Cstring,), "SHELL")
unsafe_string(path)
"/bin/bash"
For C, you will typically write a wrapper in ''Julia'' to handle errors, like:
function getenv(var::AbstractString)
val = ccall(:getenv, Cstring, (Cstring,), var)
if val == C_NULL
error("getenv: undefined variable: ", var)
end
return unsafe_string(val)
end
getenv (generic function with 1 method)
println(getenv("USER"))
println(getenv("SMOKE")) # ⇒ will through an exception unless you have SMOKE in your environment
pgras
getenv: undefined variable: SMOKE
Stacktrace:
[1] error(::String, ::String)
@ Base ./error.jl:42
[2] getenv(var::String)
@ Main ./In[13]:4
[3] top-level scope
@ In[14]:2
[4] eval
@ ./boot.jl:360 [inlined]
[5] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
@ Base ./loading.jl:1094
$ python3 -m pip install julia # install PyJulia
... # you may need `--user` after `install`
$ python3
>>> import julia
>>> julia.install() # install PyCall.jl etc.
>>> from julia import Base # short demo
>>> Base.sind(90)
1.0
⚡ It's an extremely simple analysis, way far from usual LHC analysis
Let's go,
- Julia offers the code running performance of C++, with the same programming language the ease of programming of Python, and a large Ecosystem
- Julia is the ideal language for our usage
- It's now mature enough to be used for HEP
- It's the right time to catch the Julia's train
- Needs developers to provides tools, especially for ROOT I/O and a ROOT binding
- Needs to be advertised to users. Most of the Physicist does not know it.
Thanks to Jim Pivarski for the tricks to turn a notebook into a presentations and the nice tutorial he gave on the subject