Running Apache Spark on the Google Cloud

Learn how to run a single node Apache Spark instance on Google Cloud Engine reading files on Google Cloud Storage.

Recently we announced that we had implemented a setup that allowed us to bypass Hadoop and connect Apache Spark to Google Cloud Storage.

As far as we know this has never or rarely been done before, but it is important to point out that the information has always been available on the internet. It was only a matter of time before someone put the pieces together and got it to work.

I’ve developed a quick tutorial to show you how you can setup a single node instance on Google Cloud Engine so Spark can read files direcly from Google Cloud Storage.

Tutorial

1. The first thing you need to do is create an instance on Google Cloud Engine. There are two important parameters to watch out for: service_account and service_account_scope. You have to make sure the service account has access to your Google Cloud Storage bucket.

gcutil --project=PROJECTID addinstance test-01 --wait_until_running --image=debian-7 --zone=us-central1-a --machine_type=n1-standard-1 --service_account=USERID@developer.gserviceaccount.com --service_account_scope=storage-full

2. Go to the instance.

gcutil ssh test-01
sudo apt-get update

3. Install Java and then install Git, as you will need it later for the Spark assembly.

sudo apt-get -y install openjdk-7-jre
sudo apt-get -y install openjdk-7-jdk
sudo apt-get -y install git

4. Fetch and uncompress scala then fetch and uncompress Spark.

wget http://www.scala-lang.org/files/archive/scala-2.10.4.tgz && tar -xvfscala-2.10.4.tgz
wget http://d3kbcqa49mib13.cloudfront.net/spark-0.9.1.tgz && tar -xvf spark-0.9.1.tgz

5. Compile Spark.

cd spark-0.9.1
sbt/sbt assembly

6. Now let’s create our core-site.xml config file in ./spark-0.9.1/conf/ with this content.

<configuration>
<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
</property>
<property>
<name>fs.gs.project.id</name>
<value>PROJECTID</value>
</property>
<property>
<name>fs.gs.system.bucket</name>
<value>BUCKETNAME</value>
</property>
</configuration>

7. And fetch the Google Cloud Storage connector.

cd lib_managed/jars
wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar

8. Next we set the environment variables.

export PATH=/home/USER/scala-2.10.4/bin:$PATH
export SCALA_HOME=/home/USER/scala-2.10.4
export SPARK_CLASSPATH=/home/USER/spark-0.9.1/lib_managed/jars/gcs-connector-1.2.4.jar

9. Now create a python file called test.py to test our setup with this content.

from pyspark import SparkContext

logFile = "gs://BUCKETNAME/test/*gz"
sc = SparkContext("local", "test_script")
logData = sc.textFile(logFile).cache()
print logData.count()

10. Now execute it.

/home/USER/spark-0.9.1/bin/pyspark test.py

You should now be seeing output to the console.

...
14/05/26 01:05:48 INFO BlockManagerMaster: Updated info of block rdd_1_359
14/05/26 01:05:48 INFO PythonRDD: Times: total = 276, boot = 2, init = 274, finish = 0
14/05/26 01:05:48 INFO Executor: Serialized size of result for 359 is 603
14/05/26 01:05:48 INFO Executor: Sending result for 359 directly to driver
14/05/26 01:05:48 INFO Executor: Finished task ID 359
14/05/26 01:05:48 INFO TaskSetManager: Finished TID 359 in 290 ms on localhost (progress: 360/360)
14/05/26 01:05:48 INFO DAGScheduler: Completed ResultTask(0, 359)
14/05/26 01:05:48 INFO DAGScheduler: Stage 0 (count at test.py:6) finished in 78.970 s
14/05/26 01:05:48 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
14/05/26 01:05:48 INFO SparkContext: Job finished: count at test.py:6, took 79.189878283 s
...

And that’s all there is to it.

Leave a comment or contact us if you’d like to know more about running more complex instances.