Primula is a serverless shuffle operator for general-purpose serverless frameworks. It is built upon the principles of scalability and transparency and it is designed for shuffle-like operations on routine data analysis pipelines.
Primula provides several features for the automatization of shuffle-like workloads, mainly:
- Automatic inference of the optimal number of parallel workers for the shuffle operation.
- Load-balancing of workers through data sampling.
- Eager mitigation of straggler functions.
- Asynchronous MapReduce execution for performance.
Primula is based on IBM-PyWren (now LitHop), a serverless framework for massively parallel jobs. It currently supports IBM Cloud Functions as FaaS and IBM Cloud Object Storage (COS) as remote shared storage for data persistency and communication between functions.
-
Python > 3.4
Python version must be 3.4 or above.
-
IBM Cloud account
Detailed information about IBM Cloud account configuration can be found at the IBM-PyWren github repository.
- Sign up on https://cloud.ibm.com/
- Copy
config.json.template
toconfig.json
. - Create a new IBM COS bucket and insert its public and private endpoint urls and IBM COS keys into
config.json
. - Set the new bucket's name as pywren's "storage_bucket" in
config.json
. - Create a new IBM Cloud Functions namespace and a CloudFoundry organization and insert your access keys into
config.json
.
-
Dataset placement
Datasets to be processed with Primula must be located in a IBM COS bucket.
Installation of Primula in a python environment is straightforward.
- Clone this repository into your python environment.
- Move to our extended IBM-PyWren project's folder
cd pywren-ibm-primula
- Install IBM-PyWren allong with Primula.
pip install -e .
Primula can be executed both from command line or from a Jupyter Notebook. We provide an example workflow at examples/primula_example_basic.ipynb
.
- Marc Sánchez-Artigas (Universitat Rovira i Virgili) [email protected]
- Germán T. Eizaguirre (Universitat Rovira i Virgili) [email protected]
Primula: a Practical Shuffle/Sort Operator for Serverless Computing - ACM/IFIP Middleware 2020
This work has been partially supported by the EUHorizon 2020 programme under grant agreement 825184 andby the Spanish Government (PID2019-106774RB-C22).