-
Notifications
You must be signed in to change notification settings - Fork 327
Home
Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive's query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones.
There is a small corpus of helpful documents on Shark. The shark-users mailing list is also very active and will be a helpful resource for beginners. We use JIRA to track development / issues. You can either use the mailing list or JIRA to report bugs.
Running Shark Locally: Get Shark up and running on a single node for a quick spin in ~ 5 mins.
Running Shark on EC2: Launch a Shark cluster on Amazon EC2 in ~ 10 mins, including examples on how to query data in S3.
Running Shark on a Cluster: Get Shark up and running on your own cluster.
Shark User Guide: An introduction to running Shark and its API.
Building Shark from Source Code
Compatibility with Apache Hive: Deploying Shark in existing Hive Warehouses.
shark-0.2.1-bin.tgz — Shark 0.2.1 binary with patched Hive 0.9 and Spark 0.6.1 jars - Hadoop1/CDH3
shark-0.2.1-bin-hadoop2.tgz — Shark 0.2.1 binary with patched Hive 0.9 and Spark 0.6.2 jars - Hadoop2/CDH4
shark-0.2-bin.tgz — Shark 0.2 binary with patched Hive 0.9 and Spark 0.6.2 jars
hive-0.9.0-bin.tar.gz — Patched Hive 0.9
Shark Release 0.2.1 - Nov 22, 2012
Shark Release 0.2 - Oct 15, 2012
Developer Guide: For people who are interested in contributing.
Startup Tasks for New Contributors
Hive Patches: Patches we made to Hive.
Shark is developed in the UC Berkeley AMP Lab. The research and development is supported in part by NSF CISE Expeditions award CCF-1139158, gifts from Amazon Web Services, Google, SAP, Blue Goji, Cisco, Cloudera, Ericsson, General Electric, Hewlett Packard, Huawei, Intel, Microsoft, NetApp, Oracle, Quanta, Splunk, VMware and by DARPA (contract #FA8650-11-C-7136).
YourKit is kindly supporting open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .NET applications. Take a look at YourKit's leading software products: YourKit Java Profiler and YourKit .NET Profiler.
Spark: The in-memory cluster computing framework that powers Shark.
Apache Hive: Apache Hive data warehouse system.
Apache Mesos: cluster manager that provides efficient resource isolation and sharing across distributed applications.