Install Apache Spark on Ubuntu-14.04

October 31, 2014

Update: For Apache Spark 2 refer latest post

One of the previous post mentioning about install Apache Spark-0.8.0 on Ubuntu-12.04. In this post explain about detailed steps to set up Apache Spark-1.1.0 on Ubuntu. For running Spark in Ubuntu machine should install Java. Using following commands easily install Java in Ubuntu machine.

	$ sudo apt-add-repository ppa:webupd8team/java
	$ sudo apt-get update
	$ sudo apt-get install oracle-java7-installer

To check the Java installation is successful

	$ java -version

It shows installed java version

java version "1.7.0_72"_ Java(TM) SE Runtime Environment (build 1.7.0_72-b14)_ Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)

In next step is install Scala, follow the following instructions to set up Scala.

First download the Scala from here

Copy downloaded file to some location for example /urs/local/src, untar the file and set path variable,

	$ wget http://www.scala-lang.org/files/archive/scala-2.10.4.tgz
	$ sudo mkdir /usr/local/src/scala
	$ sudo tar xvf scala-2.10.4.tgz -C /usr/local/src/scala/

	$ vi .bashrc

And add following in the end of the file

	export SCALA_HOME=/usr/local/src/scala/scala-2.10.4
	export PATH=$SCALA_HOME/bin:$PATH

restart bashrc

	$ . .bashrc

To check the Scala is installed successfully

	$ scala -version

Or just type scala. It goes to scala interactive shell

	$ scala
	scala>

In next step install git. Spark build depends on git.

sudo apt-get install git

Finally, download spark ditribution from here

	$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz
	$ tar xvf spark-1.1.0.tgz

Building

SBT(Simple Build Tool) is used for building Spark, which is bundled with it. To compile the code

	$ cd spark-1.1.0

	$ sbt/sbt assembly

The building takes some time. After successfully packing you can test a sample program

	$ ./bin/run-example SparkPi 10

Then you get the output as Pi is roughly 3.14634. Spark is ready to fire

For more detail visit

Spark Interactive Shell

You can run Spark interactively through the Scala shell

	$ ./bin/spark-shell

	scala> val textFile = sc.textFile("README.md")
	scala> textFile.count()

If want to check some particular sections of spark using shell. For example run MQTT interactevely, the mqtt is defined under external for import that into spark-shell just follow the instructions

	$ sbt/sbt "streaming-mqtt/package"

Then add this package into the classpath

	$ bin/spark-shell --driver-class-path
external/mqtt/target/scala-2.10/spark-streaming-mqtt_2.10-1.1.0.jar
	scala > import org.apache.spark.streaming.mqtt._

Using this you can check your code line by line.

Accessing Hadoop Filesystems

If you have already the build source package, rebuild it against the hadoop version as follows

	$ sbt/sbt clean

You can change this by setting the SPARK_HADOOP_VERSION variable. Here uses Hadoop 2.0.0-cdh4.3.0

	$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.3.0 sbt/sbt assembly

After successfully build. You can read and write data into cdh4.3.0 clusters.

	$ .bin/spark-shell

	scala> var file = sc.textFile("hdfs://IP:8020/path/to/textfile.txt")
	scala>  file.flatMap(line => line.split(",")).map(word => (word, 1)).reduceByKey(_+_)
	scala> count.saveAsTextFile("hdfs://IP:8020/path/to/ouput")

You may find more here