Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aws docker support #662

Closed
wants to merge 56 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
19fc2fc
add redis
kaiyingshan Jul 29, 2022
399c6ee
able to run without mpi
kaiyingshan Jul 30, 2022
ffdfcca
remove useless file
kaiyingshan Jul 30, 2022
5b2da21
separate oob logic from ucx/ucc communicators
kaiyingshan Aug 21, 2022
6001c24
re-enable tests
kaiyingshan Aug 21, 2022
720d71e
code placements
kaiyingshan Aug 22, 2022
3a221bf
minor fixes
kaiyingshan Aug 22, 2022
82f4bda
added python script to run cylon ucx/ucc without mpirun
kaiyingshan Aug 29, 2022
ccb323a
mimic gather with allgather
kaiyingshan Sep 28, 2022
9aa2ce2
Fixes missing MPI_Comm in UCXConfig
mstaylor Mar 16, 2023
aed0a44
Adds CYLON_USE_REDIS flag to allow UCC/UCX builds that don't require …
mstaylor Mar 23, 2023
7670557
Changes ucc_operations to reflect code in cylondata/cylon main
mstaylor Mar 23, 2023
cdadc14
Changes adds defaults for CYLON_UCX, CYLON_UCC and CYLON_USE_REDIS
mstaylor Mar 23, 2023
6eb56f9
Changes adds defaults for CYLON_UCX, CYLON_UCC and CYLON_USE_REDIS
mstaylor Mar 23, 2023
da99851
Changes adds defaults for CYLON_UCX, CYLON_UCC and CYLON_USE_REDIS
mstaylor Mar 23, 2023
2e15781
Changes adds defaults for CYLON_UCX, CYLON_UCC and CYLON_USE_REDIS
mstaylor Mar 23, 2023
6de16fd
Adds missing constructor for UCXUCCCommunicator from main
mstaylor Mar 24, 2023
103eccd
Further resolution of differences between main and branch
mstaylor Mar 31, 2023
d337a68
Fix merge issue where CommType Type() was private and results in a co…
mstaylor Mar 31, 2023
b7b0173
Fix issue with UCX build where CreateChannel in UCX Communicator is n…
mstaylor Mar 31, 2023
9f7e38b
Fixes issue with UCX (non-UCC) tests
mstaylor Mar 31, 2023
2ff91a0
Adds support for redis build git workflow
mstaylor Apr 1, 2023
f7b6fee
fix hiredis workflow
mstaylor Apr 1, 2023
3694a2c
Adds support for redis build git workflow
mstaylor Apr 1, 2023
2ad10fa
Adds support for redis build git workflow
mstaylor Apr 1, 2023
4bcbae3
Adds support for redis build git workflow - root install via sudo
mstaylor Apr 3, 2023
0ce6df6
moves OOBType to separate hpp + cython support
mstaylor Apr 5, 2023
a65a7fe
adds oob_context cython + updates to build.py and setup.py in support…
mstaylor Apr 7, 2023
7211adf
adds oob_context cython + updates to build.py and setup.py in support…
mstaylor Apr 7, 2023
1d45b39
adds oob_context cython + updates to build.py and setup.py in support…
mstaylor Apr 7, 2023
fdbe7c9
adds oob_context cython + updates to build.py and setup.py in support…
mstaylor Apr 9, 2023
2c47ef7
adds oob_context cython + updates to build.py and setup.py in support…
mstaylor Apr 10, 2023
3f9ba96
separates redis oob contexts in separate source files to facilitate c…
mstaylor Apr 11, 2023
5022567
refactoring related to non-redis environments + introducing UCXRedisO…
mstaylor Apr 12, 2023
32005ce
refactoring related to non-redis environments + introducing UCXRedisO…
mstaylor Apr 12, 2023
11d3ed1
introduces UCCRedisOOBContext and adds calls to wrapper class
mstaylor Apr 13, 2023
fbbba2e
adds UCC Config
mstaylor Apr 16, 2023
f836668
adds UCC Config (removes redis hard dependency)
mstaylor Apr 17, 2023
e218ac0
adds necessary hooks in lib.pxd, lib.pyx and context for initDistributed
mstaylor Apr 17, 2023
3d92e3c
includes aws cf scripts for redis and minor change for operator examp…
mstaylor May 3, 2023
ccb6166
minor changes to oob contexts to support Cython initialization + upda…
mstaylor May 9, 2023
3e7e123
adds redis example and minor changes related to redis oob context
mstaylor May 9, 2023
7974db9
fixes for running redis example
mstaylor May 15, 2023
ec3b2f9
updates to redis_example to take argument for world size, redis host …
mstaylor May 18, 2023
a9f9573
adds support for ReduceOp
mstaylor May 24, 2023
33d45f4
fixes circular dependency when using CScalar
mstaylor May 26, 2023
83bc34e
UCC/UCX AllReduce partial
mstaylor May 28, 2023
d9f3abf
UCC/UCX AllReduce partial
mstaylor May 31, 2023
3a5fe29
UCC/UCX AllReduce partial
mstaylor Jun 1, 2023
91778d6
UCC/UCX AllReduce partial
mstaylor Jun 5, 2023
c46e029
UCC/UCX AllReduce partial - adds support for MPICommunicator
mstaylor Jun 14, 2023
d36d093
UCC/UCX AllReduce partial - adds support for UCXCommunicator
mstaylor Jun 14, 2023
06137c9
UCC/UCX AllReduce partial - adds boto3 push to s3 for summary and sto…
mstaylor Jun 18, 2023
c71eed9
UCC/UCX AllReduce partial - adds boto3 push to s3 for summary and sto…
mstaylor Jun 19, 2023
e005afc
partial: adds ucc-ucx dockerfile + minor cmake changes for docker bui…
mstaylor Jun 26, 2023
8c51ba5
cylon git commands
mstaylor Jun 27, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions .github/workflows/conda-cpp-redis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
name: Conda C++/Python/Redis - gcc,OpenMPI,Redis,UCX/UCC

on:
push:
branches:
- main
- 0.**
pull_request:
branches:
- main
- 0.**

jobs:
build:
runs-on: ${{ matrix.os }}
defaults:
run:
shell: bash -l {0}
strategy:
fail-fast: false
# explicit include-based build matrix, of known valid options
matrix:
include:
# 20.04 supports CUDA 11.0+
- os: ubuntu-20.04
gcc: 9
ucc: "master"

steps:
- uses: actions/checkout@v2

# Specify the correct host compilers
- name: Install/Select gcc and g++
run: |
sudo apt-get install -y gcc-${{ matrix.gcc }} g++-${{ matrix.gcc }} git
echo "CC=/usr/bin/gcc-${{ matrix.gcc }}" >> $GITHUB_ENV
echo "CXX=/usr/bin/g++-${{ matrix.gcc }}" >> $GITHUB_ENV

- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: cylon_dev
environment-file: conda/environments/cylon.yml

- name: Activate conda
run: conda activate cylon_dev

- name: Install UCC
run: |
git clone --single-branch -b ${{ matrix.ucc }} https://github.com/openucx/ucc.git $HOME/ucc
cd $HOME/ucc
echo "conda ucx: $(conda list | grep ucx)"
./autogen.sh
./configure --prefix=$HOME/ucc/install --with-ucx=$CONDA/envs/cylon_dev
make install

- name: Install Redis
run: |
git clone https://github.com/redis/hiredis.git $HOME/hiredis
cd $HOME/hiredis
make
sudo make install
git clone https://github.com/sewenew/redis-plus-plus.git $HOME/redis-plus-plus
cd $HOME/redis-plus-plus
mkdir build
cd build
cmake -DREDIS_PLUS_PLUS_CXX_STANDARD=11 ..
make
sudo make install

- name: Build cylon, pycylon and run cpp test
run: python build.py -cmake-flags="-DCYLON_UCX=1 -DCYLON_UCC=1 -DUCC_INSTALL_PREFIX=$HOME/ucc/install -DCYLON_USE_REDIS=1" -ipath="$HOME/cylon/install" --cpp --python --test

- name: Run pytest
run: python build.py -ipath="$HOME/cylon/install" --pytest

- name: Build Java
run: python build.py -ipath="$HOME/cylon/install" --java
5 changes: 5 additions & 0 deletions aws/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Running Cylon on AWS ECS

Mills Wellons Staylor, III


71 changes: 71 additions & 0 deletions aws/scripts/cloudformation/cylon-elasticache.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
AWSTemplateFormatVersion: 2010-09-09

Parameters:
AvailabilityZone1:
Type: String

AvailabilityZone2:
Type: String

CacheEngine:
Type: String

CacheEngineVersion:
Type: String

CacheNodeType:
Type: String

CacheParameterGroupName:
Type: String

CacheSecurityGroup:
Type: String

CacheSubnet1:
Type: String

CacheSubnet2:
Type: String

Prefix:
Type: String

RedisPort:
Type: Number

ReplicaCount:
Type: Number


Resources:
SubnetGroup:
Type: AWS::ElastiCache::SubnetGroup
Properties:
CacheSubnetGroupName: !Sub "${Prefix}-subnetgroup"
Description: !Sub "${Prefix}-SubnetGroup"
SubnetIds:
- !Ref CacheSubnet1
- !Ref CacheSubnet2
Tags:
- Key: "name"
Value: !Sub "${Prefix}-Redis SubnetGroup"




CacheCluster:
Type: AWS::ElastiCache::CacheCluster
Properties:
ClusterName: !Sub "${Prefix}-Redis"
CacheNodeType: !Ref CacheNodeType
CacheSubnetGroupName: !Ref SubnetGroup
Engine: !Ref CacheEngine
EngineVersion: 7.0
NumCacheNodes: 1 #has to be 1 for redis
VpcSecurityGroupIds:
- !Ref CacheSecurityGroup
Tags:
- Key: "name"
Value: !Sub "${Prefix}-Redis Cluster"
DependsOn: SubnetGroup
90 changes: 90 additions & 0 deletions aws/scripts/cloudformation/cylon-redis.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
AWSTemplateFormatVersion: "2010-09-09"
Parameters:
TemplateBucketName:
Type: String
Default: staylor.dev2

Prefix:
Type: String
Default: cylon

Architecture:
Type: String
Default: arm64

AvailabilityZone1:
Type: String
Default: us-east-1c

AvailabilityZone2:
Type: String
Default: us-east-1d

CacheEngine:
Type: String
Default: redis

CacheEngineVersion:
Type: String
Default: 6.2

CacheNodeType:
Type: String
Default: cache.t4g.micro

CacheParameterGroupName:
Type: String
Default: default.redis7.cluster.on

CacheSecurityGroupName:
Type: String
Default: sg-0da3e3dcebe706315

CacheSubnet1:
Type: String
Default: subnet-07995eea6c462cd73

CacheSubnet2:
Type: String
Default: subnet-039df5ab7fd94f516

ImageId:
Type: String
Default: /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-arm64-gp2

InstanceType:
Type: String
Default: t4g.nano

ReplicaCount:
Type: Number
Default: 1

Runtime:
Type: String
Default: python3.8

RedisPort:
Type: Number
Default: 6379


Resources:
ElastiCacheStack:
Type: AWS::CloudFormation::Stack
"DeletionPolicy" : "Delete"
Properties:
TemplateURL: !Sub "https://s3.amazonaws.com/${TemplateBucketName}/${Prefix}/${Prefix}-elasticache.yaml"
Parameters:
AvailabilityZone1: !Ref AvailabilityZone1
AvailabilityZone2: !Ref AvailabilityZone2
CacheEngine: !Ref CacheEngine
CacheEngineVersion: !Ref CacheEngineVersion
CacheNodeType: !Ref CacheNodeType
CacheParameterGroupName: !Ref CacheParameterGroupName
CacheSecurityGroup: !Ref CacheSecurityGroupName
CacheSubnet1: !Ref CacheSubnet1
CacheSubnet2: !Ref CacheSubnet2
Prefix: !Ref Prefix
RedisPort: !Ref RedisPort
ReplicaCount: !Ref ReplicaCount
132 changes: 132 additions & 0 deletions aws/scripts/cylon_scaling.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
import time
import argparse

import pandas as pd
from numpy.random import default_rng
from pycylon.frame import CylonEnv, DataFrame
from cloudmesh.common.StopWatch import StopWatch
from cloudmesh.common.dotdict import dotdict
from cloudmesh.common.Shell import Shell
from cloudmesh.common.util import writefile
from pycylon.net.ucc_config import UCCConfig
from pycylon.net.redis_ucc_oob_context import UCCRedisOOBContext
from pycylon.net.reduce_op import ReduceOp
import boto3
from botocore.exceptions import ClientError
import os

import logging
def upload_file(file_name, bucket, object_name=None):
"""Upload a file to an S3 bucket

:param file_name: File to upload
:param bucket: Bucket to upload to
:param object_name: S3 object name. If not specified then file_name is used
:return: True if file was uploaded, else False
"""

# If S3 object_name was not specified, use file_name
if object_name is None:
object_name = os.path.basename(file_name)

# Upload the file
s3_client = boto3.client('s3')
try:
response = s3_client.upload_file(file_name, bucket, object_name)
except ClientError as e:
logging.error(e)
return False
return True


def join(data=None):
global ucc_config
StopWatch.start(f"join_total_{data['host']}_{data['rows']}_{data['it']}")

redis_context = UCCRedisOOBContext(data['world_size'], f"tcp://{data['redis_host']}:{data['redis_port']}")

if redis_context is not None:
ucc_config = UCCConfig(redis_context)

if ucc_config is None:
print("unable to initialize uccconfig")

env = CylonEnv(config=ucc_config, distributed=True)

context = env.context

if context is None:
print("unable to retrieve cylon context")

communicator = context.get_communicator()

u = data['unique']

if data['scaling'] == 'w': # weak
num_rows = data['rows']
max_val = num_rows * env.world_size
else: # 's' strong
max_val = data['rows']
num_rows = int(data['rows'] / env.world_size)

rng = default_rng(seed=env.rank)
data1 = rng.integers(0, int(max_val * u), size=(num_rows, 2))
data2 = rng.integers(0, int(max_val * u), size=(num_rows, 2))

df1 = DataFrame(pd.DataFrame(data1).add_prefix("col"))
df2 = DataFrame(pd.DataFrame(data2).add_prefix("col"))

for i in range(data['it']):
env.barrier()
StopWatch.start(f"join_{i}_{data['host']}_{data['rows']}_{data['it']}")
t1 = time.time()
df3 = df1.merge(df2, on=[0], algorithm='sort', env=env)
env.barrier()
t2 = time.time()
t = (t2 - t1) * 1000
# sum_t = comm.reduce(t)
sum_t = communicator.allreduce(t, ReduceOp.SUM)
# tot_l = comm.reduce(len(df3))
tot_l = communicator.allreduce(len(df3), ReduceOp.SUM)

if env.rank == 0:
avg_t = sum_t / env.world_size
print("### ", data['scaling'], env.world_size, num_rows, max_val, i, avg_t, tot_l, file=open(data['output_summary_filename'], 'a'))
StopWatch.stop(f"join_{i}_{data['host']}_{data['rows']}_{data['it']}")

StopWatch.stop(f"join_total_{data['host']}_{data['rows']}_{data['it']}")

if env.rank == 0:
StopWatch.benchmark(tag=str(data), filename=data['output_scaling_filename'])
upload_file(file_name=data['output_scaling_filename'], bucket=data['s3_bucket'], object_name=data['s3_stopwatch_object_name'])
upload_file(file_name=data['output_summary_filename'], bucket=data['s3_bucket'],
object_name=data['s3_summary_object_name'])

env.finalize()


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="cylon scaling")
parser.add_argument('-n', dest='rows', type=int, required=True)
parser.add_argument('-i', dest='it', type=int, default=10)
parser.add_argument('-u', dest='unique', type=float, default=0.9, help="unique factor")
parser.add_argument('-s', dest='scaling', type=str, default='w', choices=['s', 'w'],
help="s=strong w=weak")
parser.add_argument('-w', dest='world_size', type=int, help="world size", required=True)
parser.add_argument("-r", dest='redis_host', type=str, help="redis address, default to 127.0.0.1",
default="127.0.0.1")
parser.add_argument("-p", dest='redis_port', type=int, help="name of redis port", default=6379)
parser.add_argument('-f1', dest='output_scaling_filename', type=str, help="Output filename for scaling results",
required=True)
parser.add_argument('-f2', dest='output_summary_filename', type=str, help="Output filename for scaling summary results",
required=True)
parser.add_argument('-b', dest='s3_bucket', type=str, help="S3 Bucket Name", required=True)
parser.add_argument('-o1', dest='s3_stopwatch_object_name', type=str, help="S3 Object Name", required=True)
parser.add_argument('-o2', dest='s3_summary_object_name', type=str, help="S3 Object Name", required=True)

args = vars(parser.parse_args())
args['host'] = "aws"
join(args)

# os.system(f"{git} branch | fgrep '*' ")
# os.system(f"{git} rev-parse HEAD")
Loading