-
Notifications
You must be signed in to change notification settings - Fork 327
Running Shark on EC2
Shark can be launched on EC2 through the same Spark EC2 scripts that come with Spark. These scripts let you launch, pause and destroy clusters that come automatically configured with Spark, Shark, HDFS, and Apache Mesos.
To run a Shark cluster on EC2, first sign up for an Amazon EC2 account on the Amazon Web Services site. Then, download Spark to your local machine using git clone git://github.com/mesos/spark.git
, and follow the instructions in the Spark EC2 guide to launch a cluster using the script in the ec2
directory. In a nutshell, you will need to do:
$ ./spark-ec2 -k <keypair-name> -i <key-file> -s <num-slaves> launch <cluster-name>
Where <keypair>
is the name of your EC2 key pair (that you gave it when you created it), <key-file>
is the private key file for your key pair, <num-slaves>
is the number of slave nodes to launch (try 1 at first), and <cluster-name>
is the name to give to your cluster.
Login to the master using the spark-ec2 login
command:
$ ./spark-ec2 -k key -i key.pem login <cluster-name>
Then, launch Shark by going into the shark
directory:
$ cd shark
$ bin/shark-withinfo
(Note that you currently need to run Shark out of the shark
directory to have it find the default metastore.)
The "with info" script prints INFO level log messages to the console. If you prefer, you can also leave these out by running bin/shark
.
You can use Hive's CREATE EXTERNAL TABLE
command to access data in a directory in S3. First, configure your S3 credentials by adding the following properties into ~/ephemeral-hdfs/conf/core-site.xml
:
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>ID</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>SECRET</value>
</property>
Then create an S3-backed table in bin/shark
as described in the Hive S3 guide:
shark> CREATE EXTERNAL TABLE table_name (col1 type1, col2 type2, ...) <storage info> LOCATION 's3n://bucket/directory/';
spark-ec2
automatically sets up two HDFS file systems: ephemeral-hdfs
, which uses the ephemeral disks attached to your VMs that go away when a VM is stopped, and persistent-hdfs
, which is EBS-backed and persists across pausing and starting the same cluster. By default, Shark stores its tables in ephemeral-hdfs
, which provides a lot of space and is excellent for temporary tables, but is not meant for long-term storage. You can change HADOOP_HOME
in conf/shark-env.sh
to change this, or explicitly upload data to S3 or to the persistent-hdfs
.
Like Hive, Shark stores its tables in /user/hive/warehouse
on the HDFS instance it's configured with. You can either create a table there with CREATE TABLE
and upload data into /user/hive/warehouse/<table_name>
, or load elsewhere in HDFS and use CREATE EXTERNAL TABLE
.
On the master node, edit ~/shark/conf/shark-env.sh
to set the SPARK_MEM
property, which is how much memory Spark will use per node. The default is 3 GB, but on machines with more RAM, you should set it to the total memory minus about 2 GB for the operating system.
To make it easy to try Shark, we've made available both a small and a large dump of Wikipedia collected by Freebase. They are available in S3 directories spark-data/wikipedia-sample
(40 MB) and spark-data/wikipedia-2010-09-12
(50 GB). Both are stored as tab-separated files containing one record for each article in Wikipedia, with five fields: article ID, title, date modified, XML, and plain text.
Let's first create an external table for the smaller sample dataset:
shark> CREATE EXTERNAL TABLE wiki_sample (id BIGINT, title STRING, last_modified STRING, xml STRING, text STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3n://spark-data/wikipedia-sample/';
Now we can query it as follows:
shark> SELECT COUNT(1) FROM wiki_small WHERE TEXT LIKE '%Berkeley%';
shark> SELECT title FROM wiki_small WHERE TEXT LIKE '%Berkeley%';
We can also cache the table in memory by using the CREATE TABLE AS SELECT
statement with a table name that ends in _cached
:
shark> CREATE TABLE wiki_small_cached AS SELECT * FROM wiki_small;
And then query the cached data for faster access:
shark> SELECT COUNT(1) FROM wiki_small_cached WHERE TEXT LIKE '%Berkeley%';
Or, we can cache just a subset of the table, such as just two of the columns (or any other SQL expression we wish):
shark> CREATE TABLE title_and_text_cached AS SELECT title, text FROM wiki_small;
Finally, you can try the same commands on the full 50 GB dataset by using:
shark> CREATE EXTERNAL TABLE wiki_full (id BIGINT, title STRING, last_modified STRING, xml STRING, text STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3n://spark-data/wikipedia-2010-09-12/';
Note that to process this dataset quickly, you'll probably need at least 15 m1.xlarge
EC2 nodes in your cluster. (Pass -s 15 -t m1.xlarge
to spark-ec2
for example.) In our tests, a 15-node cluster launched with these settings can scan the dataset from S3 in about 80 seconds and can easily cache the text
and title
columns in memory and speed up queries to about 2 seconds.