Seer has been developed in Ubuntu 20.04.4 and tested with Python 3.8, although it should be potentially executable with any Python version over 3.6. As it is more of a research prototype than a fully fledged framework, we recommend testing it on its native development technologies.
Simply download the source code from the GitHub repository and install the library through pip.
wget https://github.com/GEizaguirre/seercloud/archive/refs/heads/main.zip
unzip main.zip
cd seercloud-main
pip3 install .
Seer relies on Lithops to configure its backend components. Configuration of cloud functions and object storage is straightforward, and it can be performed using free tier accounts of most clouds. For IBM Cloud, for instance, fill a yaml configuration file following the Lithops guideline.
For Seer's native cloud functions & object storage architecture, the configuration file should be structured as following.
lithops:
backend : ibm_cf
storage : ibm_cos
ibm_cf:
endpoint : https://xx-xx.functions.cloud.ibm.com
namespace : xxxxxxxxx
api_key : xxxxxxxxxxx
runtime_memory : mmmm
ibm_cos:
storage_bucket : bucket_name
endpoint : https://s3.xx-xx.cloud-object-storage.appdomain.cloud
private_endpoint : https://s3.private.xx-xx.cloud-object-storage.appdomain.cloud
access_key : xxxxxxxxxxxxx
secret_key : xxxxxxxxxxxxx
We recommend loading your configuration yaml file explicitly using the yaml python library. For instance:
import yaml
config = yaml.safe_load(open("../config.yaml", "r"))
Jobs are Seer's pipeline execution managers. Stages and operations must be added to the job
programmatically. Configuration parameters must be transferred to Seer as an argument of each Job.
The code in the example loads the configuration automatically from ../config/config_cloud.yaml
.
Example code for a parallel sort
from seercloud.scheduler import Job
from seercloud.operation import Scan, Exchange, Sort, Write
from tests.config import cloud_config
job = Job ( num_stages = 2, lithops_config = cloud_config())
job.add(stage = 0, op = Scan, file ="terasort_1GB.csv", bucket ="seer-data2")
job.add( stage = 0, op = Exchange )
job.add( stage = 1, op = Sort, key = "0" )
job.add( stage = 1, op = Write, bucket = "seer-data2", key = "terasort_1GB_sorted")
job.dependency ( parent = 0, child = 1)
job.run()
For a straightforward, end-to-end run of Seer, from installation, configuration to a basic, parallel sort execution, steps are exactly the following.
- Start an empty VM with Ubuntu 20.04, python 3.8 and pip installed.
sudo apt update
sudo apt install python3.8
sudo apt install python3-pip
sudo apt install unzip
- Copy the configuration file into a directory named
config
. The file should have the nameconfig_cloud.yaml
. Theconfig
directory should be placed on seercloud execution path's parent directory.
mkdir config
cp config.yaml config/config_cloud.yaml
- Execute the installation code described above.
wget https://github.com/GEizaguirre/seercloud/archive/refs/heads/main.zip
unzip main.zip
mv seercloud-main seercloud
cd seercloud
pip3 install -e .
- Run the following command.
python3.8 tests/pipelines/cloud/sort.py