Running Shark on EC2

Shark can be launched on EC2 using the Spark EC2 scripts that come with Spark. These scripts let you launch, pause and destroy clusters that come automatically configured with HDFS, Spark, Apache Mesos, and Shark.

Launching a Cluster

To run a Shark cluster on EC2, first sign up for an Amazon EC2 account on the Amazon Web Services site. Then, download Spark to your local machine using git clone git://github.com/mesos/spark.git, go into its ec2 directory, and follow the instructions in the Spark EC2 guide to launch a cluster. In a nutshell, you will need to do:

$ ./spark-ec2 -k <keypair-name> -i <key-file> -s <num-slaves> launch <cluster-name>

Where <keypair> is the name of your EC2 key pair (that you gave it when you created it), <key-file> is the private key file for your key pair, <num-slaves> is the number of slave nodes to launch (try 1 at first), and <cluster-name> is the name to give to your cluster.

Login to the master using the spark-ec2 login command:

$ ./spark-ec2 -k key -i key.pem login <cluster-name>

Then, launch Shark by going into the shark directory:

$ cd shark
$ bin/shark-withinfo

(Note that you currently need to run Shark out of the shark directory to have it find the default metastore.)

The "with info" script prints INFO level log messages to the console. If you prefer, you can also leave these out by running bin/shark.

Accessing Data in S3

You can use Hive's CREATE EXTERNAL TABLE command to access data in a directory in S3. First, configure your S3 credentials by adding the following properties into ~/ephemeral-hdfs/conf/core-site.xml:

<property>
    <name>fs.s3n.awsAccessKeyId</name>
    <value>ID</value>
</property>

<property>
    <name>fs.s3n.awsSecretAccessKey</name>
    <value>SECRET</value>
</property>

Then create an S3-backed table in bin/shark as described in the Hive S3 guide:

shark> CREATE EXTERNAL TABLE table_name (col1 type1, col2 type2, ...) <storage info> LOCATION 's3n://bucket/directory/';

Creating Tables in HDFS

spark-ec2 automatically sets up two HDFS file systems: ephemeral-hdfs, which uses the ephemeral disks attached to your VMs that go away when a VM is stopped, and persistent-hdfs, which is EBS-backed and persists across pausing and starting the same cluster. By default, Shark stores its tables in ephemeral-hdfs, which provides a lot of space and is excellent for temporary tables, but is not meant for long-term storage. You can change HADOOP_HOME in conf/shark-env.sh to change this, or explicitly upload data to S3 or to the persistent-hdfs.

Like Hive, Shark stores its tables in /user/hive/warehouse on the HDFS instance it's configured with. You can either create a table there with CREATE TABLE and upload data into /user/hive/warehouse/<table_name>, or load elsewhere in HDFS and use CREATE EXTERNAL TABLE.

Configuring Memory Size

On the master node, edit ~/shark/conf/shark-env.sh to set the SPARK_MEM property, which sets how much RAM to use per node. The default is 3 GB, but on machines with more RAM, you should set it to the total memory minus about 2 GB for the operating system.

Example: Wikipedia Data

To make it easy to try Shark, we've made available both a small and a large dump of Wikipedia collected by Freebase. They are available in S3 directories spark-data/wikipedia-sample (40 MB) and spark-data/wikipedia-2010-09-12 (50 GB). Both are stored as tab-separated files containing one record for each article in Wikipedia, with five fields: article ID, title, date modified, XML, and plain text.

Let's first create an external table for the smaller sample dataset:

shark> CREATE EXTERNAL TABLE wiki_sample (id BIGINT, title STRING, last_modified STRING, xml STRING, text STRING)
       ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3n://spark-data/wikipedia-sample/';

Now we can query it as follows:

shark> SELECT COUNT(1) FROM wiki_small WHERE TEXT LIKE '%Berkeley%';
shark> SELECT title FROM wiki_small WHERE TEXT LIKE '%Berkeley%';

We can also cache the table in memory by using the CREATE TABLE AS SELECT statement with a table name that ends in _cached:

shark> CREATE TABLE wiki_small_cached AS SELECT * FROM wiki_small;

And then query the cached data for faster access:

shark> SELECT COUNT(1) FROM wiki_small_cached WHERE TEXT LIKE '%Berkeley%';

Or, we can cache just a subset of the table, such as just two of the columns (or any other SQL expression we wish):

shark> CREATE TABLE title_and_text_cached AS SELECT title, text FROM wiki_small;

Finally, you can try the same commands on the full 50 GB dataset by using:

shark> CREATE EXTERNAL TABLE wiki_full (id BIGINT, title STRING, last_modified STRING, xml STRING, text STRING)
       ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3n://spark-data/wikipedia-2010-09-12/';

Note that to process this dataset quickly, you'll probably need at least 15 m1.xlarge EC2 nodes in your cluster. (Pass -s 15 -t m1.xlarge to spark-ec2 for example.) In our tests, a 15-node cluster launched with these settings can scan the dataset from S3 in about 80 seconds and can easily cache the text and title columns in memory and speed up queries to about 2 seconds.

Provide feedback

Saved searches