Since Apache Spark 1.4 came out I've been wanting to deploy it on my computer to test it out. I hit several bumps along the way, however at this point I think I've got a pretty good idea of how it's done.

Download Spark

The first step is to get the newest version of spark downloaded on your machine. Go to this link, and select Spark 1.4 pre-built for hadoop 2.6. This is the version that I've installed. Once it's downloaded, untar it either with your file browser, or with the command tar -xzf spark-1.4.0-bin-hadoop2.6.tgz. Now copy this entire directory into a safe place on your computer.

Setting up Your Environment

Once we have the directory all safe and secure we need to establish some environment variables, most notably the $SPARK_HOME variable.

In your .bashrc, add the following line:

export $SPARK_HOME=/home/william/Projects/spark/spark-1.4.0-bin-hadoop2.6

Where you replace the directory with whichever directory spark resides in on your machine.

Configuring Spark Options

Since 1.3 the method for deploying a standalone cluster has changed a little bit. That being said, we need to configure the worker options. First, copy the template file provided in $SPARK_HOME/conf/spark-env.sh.template to $SPARK_HOME/conf/spark-env.sh. Now edit the file with your favorite editor and add the following line:

SPARK_WORKER_INSTANCES=4

This will tell the scripts to create 4 workers. You can adjust this as necessary, as well as adjust any other parameters in the configuration file.

The Run Script

The file $SPARK_HOME/sbin/start-all.sh will start the master and all of your workers. It's kinda clunky, so I wrote a Python script to make things work nicer.

#!/usr/bin/env python3

import os
import argparse
import time

parser = argparse.ArgumentParser()
parser.add_argument('-p', '--python', action='store_true', default=False)
args = parser.parse_args()

try:
    os.system('bash $SPARK_HOME/bin/load-spark-env.sh')
    os.system('bash $SPARK_HOME/sbin/start-all.sh')
    print("Spark Running")
    if args.python:
        os.system('ipython2.7 notebook --profile=pyspark')
    else:
        while True:
            time.sleep(3600)
except KeyboardInterrupt:
    os.system('$SPARK_HOME/sbin/stop-all.sh')

And that's it! If you put that script in $SPARK_HOME, it will automatically load all other environment variables and start Spark. Once you press Ctrl+c it will gracefully quit, shutting down spark with it.