Hamiltonian observables: Lightning's OMP parallelziation vs OpenBLAS #455

chaeyeunpark · 2023-06-11T15:25:28Z

For computing gradient for a circuit with expectation values of a Hamiltonian object, Lightning implements OpenMP parallelized function that distributes Hamiltonian terms to threads:

pennylane-lightning/pennylane_lightning/src/simulator/Observables.hpp

Lines 309 to 347 in 58c9e1c

    
           #if defined(_OPENMP) 
        
           template <class T> struct HamiltonianApplyInPlace<T, true> { 
        
               static void run(const std::vector<T> &coeffs, 
        
                               const std::vector<std::shared_ptr<Observable<T>>> &terms, 
        
                               StateVectorManagedCPU<T> &sv) { 
        
                   const size_t length = sv.getLength(); 
        
                   auto allocator = sv.allocator(); 
        
                   std::vector<std::complex<T>, decltype(allocator)> sum( 
        
                       length, std::complex<T>{}, allocator); 
        
           #pragma omp parallel default(none) firstprivate(length, allocator)             \ 
        
               shared(coeffs, terms, sv, sum) 
        
                   { 
        
                       StateVectorManagedCPU<T> tmp(sv.getNumQubits()); 
        
                       std::vector<std::complex<T>, decltype(allocator)> local_sv( 
        
                           length, std::complex<T>{}, allocator); 
        
           #pragma omp for 
        
                       for (size_t term_idx = 0; term_idx < terms.size(); term_idx++) { 
        
                           tmp.updateData(sv.getDataVector()); 
        
                           terms[term_idx]->applyInPlace(tmp); 
        
                           Util::scaleAndAdd(length, 
        
                                             std::complex<T>{coeffs[term_idx], 0.0}, 
        
                                             tmp.getData(), local_sv.data()); 
        
                       } 
        
           #pragma omp critical 
        
                       { 
        
                           Util::scaleAndAdd(length, std::complex<T>{1.0, 0.0}, 
        
                                             local_sv.data(), sum.data()); 
        
                       } 
        
                   } 
        
                   sv.updateData(sum); 
        
               } 
        
           }; 
        
           #endif

However, the Util::scaleAndAdd function calls OpenBLAS's cblas_caxpy or cblas_zaxpy when compiled with OpenBLAS, which is the case for the PyPI-provided wheels. As these functions are parallelized internally by OpenBLAS, turning off the internal parallelization of OpenBLAS might be necessary to prevent threads oversubscription (or vice versa).

Edit: Indeed, it's subtle. Locally, I found that turning off OpenBLAS parallelism is better performing, but the opposite in Perlmutter.

The text was updated successfully, but these errors were encountered:

mlxd · 2023-06-12T13:55:18Z

Yea, this was always a tough problem. It depends on CPU model, observable type, and even circuit type.
We could try updating the OpenMP scheduling here in the section, and seeing if it works nicely.

for defaults though, at least until we better understand the path we need, I'd say favouring large-scale CPUs (HPC systems, AWS Braket servers) would be better as the default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hamiltonian observables: Lightning's OMP parallelziation vs OpenBLAS #455

Hamiltonian observables: Lightning's OMP parallelziation vs OpenBLAS #455

chaeyeunpark commented Jun 11, 2023 •

edited

Loading

mlxd commented Jun 12, 2023

Hamiltonian observables: Lightning's OMP parallelziation vs OpenBLAS #455

Hamiltonian observables: Lightning's OMP parallelziation vs OpenBLAS #455

Comments

chaeyeunpark commented Jun 11, 2023 • edited Loading

mlxd commented Jun 12, 2023

chaeyeunpark commented Jun 11, 2023 •

edited

Loading