Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hamiltonian observables: Lightning's OMP parallelziation vs OpenBLAS #455

Open
chaeyeunpark opened this issue Jun 11, 2023 · 1 comment
Open

Comments

@chaeyeunpark
Copy link
Contributor

chaeyeunpark commented Jun 11, 2023

For computing gradient for a circuit with expectation values of a Hamiltonian object, Lightning implements OpenMP parallelized function that distributes Hamiltonian terms to threads:

#if defined(_OPENMP)
template <class T> struct HamiltonianApplyInPlace<T, true> {
static void run(const std::vector<T> &coeffs,
const std::vector<std::shared_ptr<Observable<T>>> &terms,
StateVectorManagedCPU<T> &sv) {
const size_t length = sv.getLength();
auto allocator = sv.allocator();
std::vector<std::complex<T>, decltype(allocator)> sum(
length, std::complex<T>{}, allocator);
#pragma omp parallel default(none) firstprivate(length, allocator) \
shared(coeffs, terms, sv, sum)
{
StateVectorManagedCPU<T> tmp(sv.getNumQubits());
std::vector<std::complex<T>, decltype(allocator)> local_sv(
length, std::complex<T>{}, allocator);
#pragma omp for
for (size_t term_idx = 0; term_idx < terms.size(); term_idx++) {
tmp.updateData(sv.getDataVector());
terms[term_idx]->applyInPlace(tmp);
Util::scaleAndAdd(length,
std::complex<T>{coeffs[term_idx], 0.0},
tmp.getData(), local_sv.data());
}
#pragma omp critical
{
Util::scaleAndAdd(length, std::complex<T>{1.0, 0.0},
local_sv.data(), sum.data());
}
}
sv.updateData(sum);
}
};
#endif

However, the Util::scaleAndAdd function calls OpenBLAS's cblas_caxpy or cblas_zaxpy when compiled with OpenBLAS, which is the case for the PyPI-provided wheels. As these functions are parallelized internally by OpenBLAS, turning off the internal parallelization of OpenBLAS might be necessary to prevent threads oversubscription (or vice versa).

Edit: Indeed, it's subtle. Locally, I found that turning off OpenBLAS parallelism is better performing, but the opposite in Perlmutter.

@mlxd
Copy link
Member

mlxd commented Jun 12, 2023

Yea, this was always a tough problem. It depends on CPU model, observable type, and even circuit type.
We could try updating the OpenMP scheduling here in the section, and seeing if it works nicely.

for defaults though, at least until we better understand the path we need, I'd say favouring large-scale CPUs (HPC systems, AWS Braket servers) would be better as the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants