Shark Release 0.2

Shark 0.2.1 Release

Release date: Nov 22, 2012 (Happy Thanksgiving!)

Shark 0.2.1 is a minor release for bug fixing.

Spark 0.6.1: We upgraded the Spark version from 0.6 to 0.6.1. The new version of Spark fixes a number of stability and reliability issues. See the Spark 0.6.1 changelog for more information.
Allow spilling large tables to disk: Shark 0.2.1 now allows spilling tables that are larger than the collective memory of a cluster to disk.

Shark 0.2 Release

Release date: Oct 15, 2012

Shark 0.2 is the first Shark release since the original 0.1 prototype release. The new version brings new features, performance improvements, and stability to Shark. See the documentation on the Github wiki to get started: https://github.com/amplab/shark/wiki

Major changes are documented below:

Hive Compatibility

Shark now works with Hive 0.9, which introduces numerous features over the original Hive 0.7.
Hive UDFs and UDAFs are fully supported now.
Shark 0.2 also supports distributing resource files (e.g. jars) to the slaves using Hive's ADD FILE command.

Simpler Deployment

We have significantly simplified the deployment process.
For example, Running Shark Locally contains a guide to launch Shark 0.2 locally in ~ 5 mins.
In addition to running on Mesos, Shark now supports Spark's standalone deploy mode that lets you quickly launch a cluster without installing an external cluster manager. The standalone mode only needs Java installed on each machine, with Spark deployed to it.

Hive Thrift Server

Ram Sriharsha from Yahoo contributed a patch for the Shark Thrift server, which is compatible with Hive's Thrift server.
The Thrift server starts a long-running server and support multiple clients connecting to it. These clients can access the same warehouse, using the same set of cached tables.
To start the server on the default 10000 port, do

$ bin/shark --service sharkserver

Query Execution and Performance Improvements

Map side aggregation is now turned on by default and if not enough reduction is observed, Shark will turn off map side aggregation automatically. The user no longer needs to explitictly set hive.map.aggr.
We have rewritten Shark's join and group by code. For queries that have a large number of distinct keys, join and group by performance can increase by 2X.

Spark Compatibility

Shark 0.2 requires Spark 0.6 as it takes advantage of the new features and performance improvements from the new Spark release.

Miscellaneous

If you feel _cached is a hacky way to indicate whether a table should be cached in memory, Shark 0.2 supports specifying the boolean flag using table properties when the table is created. For example

CREATE TABLE myTable TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM myInput;

Credits

Shark 0.2 was the work of a large set of new contributors from Berkeley and outside.

Ram Sriharsha from Yahoo contributed a patch for the Shark Thrift server.
Harvey Feng contributed the Hive 0.9 upgrade and improved map join implementation.
Antonio Lupher contributed the map side aggregation tuning implementation.
Denny Britz contributed support for ADD FILE and UDF/UDAF dynamic class loading.
Patrick Wendell contributed the revamped documentation and extensive testing.
Paul Ruan helped with testing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly