-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial
Spartan is still a work in progress. You may encounter issues while running this tutorial; if this happens, please file an issue with the bug tracker.
This tutorial will lead you through fetching the Spartan code and setting it up on your machine, and the writing and execution of a simple application (linear regression).
The source distribution of Spartan requires Cython to be installed. If you do not have it already, you can install it via pip:
pip install [--user] cython
or, on Debian systems via:
apt-get install cython
The newest version of Spartan is available via the GitHub repo; clone it to your machine using:
git clone https://github.com/rjpower/spartan.git
To install Spartan and it's dependencies, use setup.py
:
cd spartan
python setup.py develop --user
We're now ready to start using Spartan. To use Spartan in an application, just import it and call initialize
:
import spartan as sp
sp.initialize()
By default, Spartan runs in a multi-threaded mode. This is convenient for testing, but because of the Python GIL, it won't run any faster than a single process. If you want your application to run faster, you'll have to start Spartan in cluster mode.
Spartan has builtin support for running on a cluster of machines via ssh. To run Spartan in cluster mode, we just change our call to initialize
(alternatively, we can specify options via command line flags):
sp.initialize(['--cluster=1', '--hosts=localhost:4,foxtrot:8,bobcat:8'])
This tells Spartan to run 4 worker processes on the local machine, and 8 processes on each of foxtrot and bobcat. This assumes you have passwordless ssh access to the machines (i.e. you are using public key authentication and ssh-agent
).
The full list of flags understood by Spartan can be found by running:
python -c 'import spartan; spartan.initialize(["--help"])
Specifying flags on the command line or via initialize
is a pain. Instead of doing this every time, we can put any flags we want to use into the spartan.ini
in our home directory: $HOME/.config/spartan/spartan.ini
. Flags will automatically be pulled in via this file (command line options will override the ini settings).
# spartan.ini
[flags]
hosts=a:8,b:8,c:8
cluster=1
An important note before we start: Spartan looks like NumPy, but uses lazy evaluation to capture expressions into an expression graph before running them. This results in a few differences from normal NumPy code. For example, if we run the following code (or via IPython):
# test_simple.py
import spartan as sp
sp.initialize()
x = sp.rand(10000, 10000)
y = sp.rand(10000, 10000)
z = x + y
We find that it executes faster than we'd expect. What's happening is that the operation is being deferred; which we can see if we print z
:
print z
MapExpr {
children = DictExpr {
vals = {'k3': MapExpr {
children = DictExpr {
vals = {'k1': NdArrayExpr {
_shape = (10000, 10000),
sparse = False,
dtype = <type 'float'>,
tile_hint = None,
reduce_fn = None,
expr_id = 3,
...
If we want to make sure a Spartan expression is evaluated, we can force it:
z.force()
After we do this, our console will stall for a bit while computing the result. We can inspect z
using the normal slicing operators:
zslice = z[0:10, 0:10]
If we print zslice
, we see that it's another expression node. We can see the actual result by calling glom():
print zslice.glom()
[[ 0.40431615 0.78758898 0.64372971 0.83738517 0.35252063 0.61085179
0.50201212 0.77996823 1.01946723 1.54100078]
[ 0.8255713 0.9784094 0.5944809 0.9151916 1.62231947 0.6985127
1.05003632 1.10276565 0.50976401 1.79484165]
[ 1.54347696 0.91283842 1.21791409 1.56077292 0.81929879 1.21397101
0.7277431 1.19146302 1.08149324 1.30490862]
[ 0.82468134 0.63385957 1.38083906 1.4475998 1.55722686 1.59542322
0.71032193 1.22207764 1.39695799 0.56424774]
[ 1.92879978 1.07464252 0.54652076 0.60779678 1.4911869 0.7863396
0.77091178 0.41473159 1.78402857 1.46132885]
[ 1.3920112 0.71718343 0.04712277 1.78117627 0.53857002 0.85893516
0.57882432 0.85399033 1.28200041 1.4449996 ]
[ 1.04510724 0.99072941 0.65680299 1.10509358 1.17346329 0.87073785
1.1710321 0.55426738 1.36207195 0.29851448]
[ 1.68384304 0.39496023 1.61920443 0.06775426 1.45594822 1.28999251
1.09191703 0.20535368 0.43640492 0.52627781]
[ 0.62870181 1.15012164 0.62304233 0.90594462 1.05958128 0.64907288
0.93111492 1.3595818 0.84221813 1.60843973]
[ 1.64512868 1.20342383 1.66162832 1.27969195 1.21537476 0.52412064
1.00017709 1.32339968 0.64233495 1.34834738]]
We're now ready to write a real application using Spartan; in this case, we're going to implement linear regression on a made up dataset.
import spartan as sp
sp.initialize()
N_DIM = 10
N_EXAMPLES = 1000 * 1000
EPSILON = 1e-6
x = 100 * sp.ones((N_EXAMPLES, N_DIM)) + sp.rand(N_EXAMPLES, N_DIM)
y = sp.ones((N_EXAMPLES, 1))
# put weights on one server
w = sp.rand(N_DIM, 1)
for i in range(50):
yp = sp.dot(x, w)
diff = x * (yp - y)
grad = sp.sum(diff, axis=0).reshape((N_DIM, 1))
w = w - (grad / N_EXAMPLES * EPSILON)
print grad.sum().glom()
We can run this with the log level lowered to avoid seeing as many messages:
python lreg.py --log_level=WARN
For a simple problem like this one, and with such a small amount of data, Spartan can end up being slower than Numpy. As we increase our dataset size, we'll expect to see better performance.