Skip to content
This repository has been archived by the owner on Apr 24, 2020. It is now read-only.
/ spread Public archive

Spread is a missing link in the unix toolchain to do simple distributed data mining.

License

Notifications You must be signed in to change notification settings

wavii/spread

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spread

Spread is a missing link in the unix toolchain to do simple distributed data mining. Spread partitions data according to key across a fleet. Spread shines when used in conjunction with fleet management/coordination/persistence tools like:

Quickstart

You'll need CMake and Boost to build spread:

sudo apt-get install -y cmake libboost-all-dev

Then you can fetch and install spread:

git clone [email protected]:wavii/spread.git
cd spread
cmake . && make && sudo make install

Examples

Given a hostfile on each machine with the format host1\nhost2\nhost3\n..., here is a distributed wordcount using spread:

tr ' ' '\n' < input | spread -f hosts | sort | uniq -c > output

Spread can also be used as a library for writing your own map programs in C++:

void map(tcp::endpoint& local_endpoint, vector<tcp::endpoint>& remote_endpoints)
{
   asio::io_service io;
   spreader<> s(io, local_endpoint, remote_endpoints, cout);
   string line;
   while (getline(cin, line))
      s << line.substr(line.find("USER: ")); // map the user in a log line
}

Spread works well with cloud concepts. Here's an example of a bash script you could run on each host to operate on S3 data in parallel:

#!/bin/bash

# grab fleet index and size, perhaps this was provided on each machine by Chef
fleetsize=$(wc -l .fleet-hosts | awk '{print $1}')
fleetsize=$(echo "obase=16; $fleetsize" | bc)
id=$(cat .fleet-id)

for uri in `s3cmd ls s3://data-warehouse/ | awk '{print $4}'`; do
   md5id=$(echo $uri | md5sum | awk '{print toupper($1)}')
   remainder=$(echo "ibase=16; $md5id % $fleetsize" | bc)
   if [ $remainder -eq $id ]; then
      s3cmd get --no-progress --force $uri /mnt/input/
   fi
done

# now spread the input we've collected on each machine
cat /mnt/input/* | tr ' ' '\n' | spread -f hosts | sort | uniq -c > output

License

Spread via the MIT License. Have fun!

About

Spread is a missing link in the unix toolchain to do simple distributed data mining.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published