This project contains the code to interact with the SFrame open-source project from within Apache Spark. Currently, the jar created by this project is included in the GraphLab Create python egg to enable translation between Apache Spark Dataframes and GraphLab Create SFrames. Users can also use this project in the scala spark shell to export Dataframes as SFrames.
The spark-sframe package leverages multiple shared libraries: cy_callback.so, cy_cpp_utils.so, cy_flexible_type.so, cy_spark_unity.so
which are directly built from the open-source SFrame package.
To build the spark-unity.jar
all you need is to have the java jdk installed on your platform and run our pre-bundled mvn
:
cd spark-sframe
build/mvn package
This will both test and build the spark-unity.jar
on your platform.
To use GraphLab Create within PySpark, you need to set the $SPARK_HOME
and $PYTHONPATH
environment variables on the driver. A common usage:
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
export SPARK_HOME =<your-spark-home-dir>
cd $SPARK_HOME
bin/pyspark
Make sure you have exported the PYTHONPATH
and SPARK_HOME
environment variables. Then run (for example):
ipython
Then you need to start spark:
from pyspark import SparkContext
from pyspark.sql import SQLContext
# Launch spark by creating a spark context
sc = SparkContext()
# Create a SparkSQL context to manage dataframe schema information.
sql = SQLContext(sc)
from sframe import SFrame
rdd = sc.parallelize([(x,str(x), "hello") for x in range(0,5)])
sframe = SFrame.from_rdd(rdd, sc)
print sframe
from sframe import SFrame
rdd = sc.parallelize([(x,str(x), "hello") for x in range(0,5)])
df = sql.createDataFrame(rdd)
sframe = SFrame.from_rdd(df, sc)
print sframe
cd $SPARK_HOME
bin/spark-shell --jars spark-sframe/target/spark_unity-0.1.jar
import org.graphlab.create.GraphLabUtil
var df = sc.parallelize(Array(1,2,3)).toDF // Must be a dataframe
val outputDir = "/tmp/graphlab_testing" // Must be an HDFS path unless running in local mode
val prefix = "test"
val sframeFileName = GraphLabUtil.toSFrame(df, outputDir, prefix)
println(sframeFileName)
import org.graphlab.create.GraphLabUtil
val newRDD = GraphLabUtil.toRDD(sc, "/tmp/graphlab_testing/test.frame_idx")
The currently release requires Python 2.7, Spark 1.3 or later, and the hadoop
binary must be within the PATH
of the driver when running on a cluster or interacting with Hadoop
(e.g., you should be able to run hadoop classpath
).
We also currently only support Mac and Linux platforms but will have Windows support soon.
We recommend downloading Pre-built for Hadoop 2.4 and later
version of Apache Spark.