Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory profiling of qml.state() #771

Closed
tomlqc opened this issue Jun 19, 2024 · 1 comment
Closed

Memory profiling of qml.state() #771

tomlqc opened this issue Jun 19, 2024 · 1 comment

Comments

@tomlqc
Copy link
Contributor

tomlqc commented Jun 19, 2024

Important Note

⚠️ This issue is part of an internal assignment and not meant for external contributors.

Context

The lightning.qubit device in PennyLane-Lightning has optimal support for many quantum gates and measurement processes at both the Python and C++ layers. The LightningMeasurements class at lightning_qubit/_measurements.py implements the Python interface for performant C++ measurement routines in MeasurementsLQubit. Among PennyLane's measurement processes, qml.state, that returns the underlying quantum state in the computational basis, is backed by the public methods of StateVectorLQubitManaged.hpp.

The Python <> C++ memory management plays an important role in the performance of qml.state, although returning the underlying state-vector is not computationally intensive. Some preliminary results determined poor scaling of qml.state in lightning.qubit comparing to default.qubit, the default pure-Python Pennylane device.

Requirements

  • Benchmark following circuit comparing lightning.qubit vs default.qubit. In this code sample, device_name can be either lightning.qubit or default.qubit and 5 < num_wires < 25. Define some thresholds when default.qubit is faster than lightning.qubit.
    dev = qml.device(device_name, wires=num_wires)
    @qml.qnode(dev)
    def circuit():
        return qml.state()
  • Profile this circuit. You're free to use any memory profiler as long as a report can be generated. Explain your choice.
  • Analyse the results of the profiler, and report the bottlenecks.
  • Describe the memory allocations and ownerships of the underlying C++ memory buffer for the state-vector.
  • Can we improve the memory management in the Python bindings for qml.state?
  • Propose a solution for one of the bottlenecks.

Please provide your answers as follow-up comments in this github issue. You may use Github Gist for larger files.

Feel free to ask any questions or raise any concerns regarding the issue. We'll be happy to discuss with you!

@josephleekl
Copy link
Contributor

Benchmarking

I have performed benchmark with the following setup:

  • Python version: 3.12
  • Pennylane version (from pip install): 0.36.0
  • Pennylane_Lightning version (from pip install): 0.36.0
  • Hardware: AMD EPYC 7543P
    I timed the circuit 50 times (after a warmup), and the timings (in seconds) are as follows:
num_wires default lightning
5 0.02415302 0.01594006
6 0.02579262 0.01603134
7 0.02741707 0.01615754
8 0.02954822 0.01612877
9 0.03091021 0.01626475
10 0.03338226 0.01591212
11 0.03487502 0.01638795
12 0.03691649 0.01649593
13 0.03912803 0.01675916
14 0.04284222 0.01769783
15 0.04721544 0.01946455
16 0.05462405 0.02161267
17 0.06258643 0.02764239
18 0.07848529 0.08313
19 0.2460156 0.12398331
20 0.21531559 0.24791661
21 0.71412075 0.50003818
22 1.45517968 1.02306128
23 2.99782932 2.01809743
24 5.91070788 3.93206317
25 11.6924843 7.68381341

Picture 1

Across the range of num_wires , lightning.qubit device runs faster default.qubit (I did not seem to observe default.qubit being faster than lightning.qubit)

Profiling

I chose to use 2 profilers:

  • Linaro MAP: excellent interface and provides comprehensive metrics. I have experience with this in the past, and found it extremely useful. This tool requires a license, and is available on the system under test. On systems without the license, I used intel vTune in the past for profiling
  • Memray: I recently tried memray, which is a python-specific memory profiler, and found the UI to be intuitive and provide memory reports very quickly. I wanted to use this tool to first identify the main locations for memory allocation:

Here I used a larger num_wires=27 to help identity the memory allocations. We first look at the profiling from memray (lightning.qubit) which shows the calls with the largest memory allocation. (I repeated the circuit twice as seen in the diagram, and focus only on the first.)
Screenshot 2024-06-23 at 18 15 28

Memory allocation

From the call stack we can see three distinct phases of memory allocation.

  1. The initial state-vector is allocated in memory in C++ during dev = qml.device(device_name, wires=num_wires , which uses pybind to call the allocation functions. This is not related to the circuit/qml.state() .
  2. Within the circuit qml.state(), when the measurement is performed in state_diagonalizing_gates:
    1. At state_array = self._qubit_state.state in _measurement.py , a new numpy array is created in memory (np.zeros in _state_vector.py ) before the data is copied from the C++ array to the numpy array (via self._qubit_state.getState(state) in _state_vector.py).
    2. At result = measurementprocess.process_state(state_array, wires) in _measurements.py, this calls Pennylane's process_state in state.py. At return qml.math.cast(state, "complex128") if is_tf_interface else state + 0.0j , it returns the state state + 0.0j. This creates an extra (in-theory temporary) copy of the state in numpy.

During the application, the state-vector in the C++ memory buffer is:

The python-binding to these C++ state-vector manipulations are from self._qubit_state in this class here: https://github.com/PennyLaneAI/pennylane-lightning/blob/master/pennylane_lightning/lightning_qubit/_state_vector.py#L41 ; this is used to call the methods to create/read/update the state-vector in C++ memory buffer from python.

Each of the copies of state vectors is about 2GB (2^27qubits * 128b complex = 2GB), and with the 3 copies created from above, explains the peak usage at ~6.51GB.

Comparing the memory footprint to the pure python default.qubit implementation, in default.qubit there isn't an extra copy in C++, and a new copy of the array is not created at return qml.math.cast(state, "complex128") if is_tf_interface else state + 0.0j , which results in a much lower memory footprint (at ~3.28 GB):

Screenshot 2024-06-23 at 20 52 03

Runtime cost

Back to lightning.qubit, in terms of the timing cost of the memory operations, we can look at the MAP profiler result:!
Screenshot 2024-06-23 at 19 01 53

This confirms that:

  • A significant amount of runtime is spent on initializing the device (however this is not relevant here)
  • 23.5% of time spent on copying the C++ state-vector to python (self._qubit_state.getState(state), right after the python vector is created with np.zero)
  • 17.6% of time is spent on array_add coming from return qml.math.cast(state, "complex128") if is_tf_interface else state + 0.0j

Bottlenecks

The latter two points above might be improved.

In terms of copying the C++ state-vector to python numpy, this may not be necessary. For this circuit, since no further gates are applied before returning the state, there is no operations before the copy. And if there is no need for an explicit copy in python, we can simply expose the C++ array by creating a view in python, without copying it like in https://github.com/PennyLaneAI/pennylane-lightning/blob/master/pennylane_lightning/core/src/simulators/lightning_qubit/bindings/LQubitBindings.hpp#L206 . It might be beneficial to have both a copy and a view method to improve general memory management.

In terms of the the array_add from the last point, in this case it is unclear why state + 0.0j is returned instead of state (assuming the initial state is the correct complex datatype).

Possible improvement

By returning state instead of state + 0.0j, it means that there is no need for a new temporary copy of the state-vector in numpy. This results in a lower memory consumption (~4.36GB). From quick testing this seems to produce identical result, but needs further/more rigorous testing to show it is correct.

Screenshot 2024-06-23 at 20 27 36

@tomlqc tomlqc closed this as completed Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants