Install Apache Spark on Ubuntu 14.04

By Prabeesh Keezhathra

October 31, 2014 - 3 minutes read - 441 words

Update: For Spark 2 see the Ubuntu 16.04 and macOS post.

This post walks through Spark 1.1.0 on Ubuntu 14.04. For the latest version, see Install Apache Spark 3.5 on Linux.

Prerequisites

Start with Java:

	$ sudo apt-add-repository ppa:webupd8team/java
	$ sudo apt-get update
	$ sudo apt-get install oracle-java7-installer

To check that the Java installation is successful:

	$ java -version

It shows installed java version

java version "1.7.0_72"_ Java(TM) SE Runtime Environment (build 1.7.0_72-b14)_ Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)

The next step is to install Scala. Follow the following instructions to set up Scala.

First download the Scala from Scala Official Website

Copy downloaded file to some location for example /urs/local/src, untar the file and set path variable,

	$ wget http://www.scala-lang.org/files/archive/scala-2.10.4.tgz
	$ sudo mkdir /usr/local/src/scala
	$ sudo tar xvf scala-2.10.4.tgz -C /usr/local/src/scala/

	$ vi .bashrc

And add following in the end of the file

	export SCALA_HOME=/usr/local/src/scala/scala-2.10.4
	export PATH=$SCALA_HOME/bin:$PATH

Restart bashrc:

	$ . .bashrc

To check the Scala is installed successfully

	$ scala -version

Or just type scala. It goes to scala interactive shell

	$ scala
	scala>

In the next step, install git. Spark build depends on git.

sudo apt-get install git

Finally, download the Spark 1.1.0 distribution:

	$ wget https://archive.apache.org/dist/spark/spark-1.1.0/spark-1.1.0.tgz
	$ tar xvf spark-1.1.0.tgz

Building Spark

SBT(Simple Build Tool) is used for building Spark, which is bundled with it. To compile the code

	$ cd spark-1.1.0

	$ sbt/sbt assembly

The building takes some time. After successfully packing you can test a sample program

	$ ./bin/run-example SparkPi 10

Then you get the output as Pi is roughly 3.14634. Spark is ready to fire

For more detail visit

Spark interactive shell

You can run Spark interactively through the Scala shell

	$ ./bin/spark-shell

1	scala> val textFile = sc.textFile("README.md")
2	scala> textFile.count()

If want to check some particular sections of spark using shell. For example run MQTT interactevely, the mqtt is defined under external for import that into spark-shell just follow the instructions

	$ sbt/sbt "streaming-mqtt/package"

Then add this package into the classpath

	$ bin/spark-shell --driver-class-path
external/mqtt/target/scala-2.10/spark-streaming-mqtt_2.10-1.1.0.jar
	scala > import org.apache.spark.streaming.mqtt._

Using this you can check your code line by line.

Accessing Hadoop filesystems

If you have already the build source package, rebuild it against the hadoop version as follows

	$ sbt/sbt clean

You can change this by setting the SPARK_HADOOP_VERSION variable. Here uses Hadoop 2.0.0-cdh4.3.0

	$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.3.0 sbt/sbt assembly

After successfully build. You can read and write data into cdh4.3.0 clusters.

	$ .bin/spark-shell

1	scala> var file = sc.textFile("hdfs://IP:8020/path/to/textfile.txt")
2	scala>  file.flatMap(line => line.split(",")).map(word => (word, 1)).reduceByKey(_+_)
3	scala> count.saveAsTextFile("hdfs://IP:8020/path/to/ouput")

You may find more quick start

Prerequisites

Building Spark

Spark interactive shell

Accessing Hadoop filesystems

Related Posts

Install Apache Spark 3.5 on Linux (Ubuntu, CentOS)

Install Apache Spark 2 on Ubuntu 16.04 and macOS