Install Apache Spark 2 on Ubuntu 16.04 and macOS
Earlier posts covered Spark 1.1.0 on Ubuntu 14.04. This one walks through Spark 2.0.2 on Ubuntu 16.04 and Mac OS X …
By Prabeesh Keezhathra
- 5 minutes read - 1010 wordsA walkthrough of installing Apache Spark 3.5 on modern Linux, from prerequisites through a working standalone cluster. Earlier versions are covered in Install Apache Spark 1.0 on Ubuntu 14.04 and Install Apache Spark 2 on Ubuntu 16.04 and macOS.
| Requirement | Version / Recommendation |
|---|---|
| Java | OpenJDK 17 (Spark 3.5 supports 8, 11, 17) |
| Python | 3.8+ for PySpark |
| Memory | 4 GB minimum, 8 GB+ for comfortable work |
| Storage | 10 GB free for install + logs |
| OS | Ubuntu 20.04+, CentOS 7+, or equivalent |
1# Ubuntu/Debian
2sudo apt update
3sudo apt install openjdk-17-jdk
4
5# CentOS/RHEL
6sudo yum install java-17-openjdk-devel
7
8# Verify installation
9java -version
10javac -version
Set JAVA_HOME environment variable:
1# Add to ~/.bashrc or ~/.profile
2echo 'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64' >> ~/.bashrc
3echo 'export PATH=$PATH:$JAVA_HOME/bin' >> ~/.bashrc
4source ~/.bashrc
5
6# Verify
7echo $JAVA_HOME
Install Python and essential packages (for PySpark):
1# Ubuntu/Debian
2sudo apt install python3 python3-pip python3-dev
3
4# CentOS/RHEL
5sudo yum install python3 python3-pip python3-devel
6
7# Install essential Python packages
8pip3 install py4j pandas numpy matplotlib
Download the latest distribution:
1# Navigate to your preferred installation directory
2cd /opt
3
4# Download Spark 3.5+ (check spark.apache.org for latest version)
5sudo wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
6
7# Alternative: Download pre-built for specific Hadoop version
8# sudo wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3-scala2.13.tgz
Extract and set up the files:
1# Extract the archive
2sudo tar -xzf spark-3.5.0-bin-hadoop3.tgz
3
4# Create symbolic link for easier management
5sudo ln -sf spark-3.5.0-bin-hadoop3 spark
6
7# Set proper permissions
8sudo chown -R $USER:$USER /opt/spark-3.5.0-bin-hadoop3
Configure environment variables:
1# Add to ~/.bashrc
2echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
3echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> ~/.bashrc
4echo 'export PYSPARK_PYTHON=python3' >> ~/.bashrc
5echo 'export PYSPARK_DRIVER_PYTHON=python3' >> ~/.bashrc
6
7# Apply changes
8source ~/.bashrc
For users needing specific configurations or latest development features:
1# Install build dependencies
2sudo apt install git maven scala
3
4# Clone Spark repository
5git clone https://github.com/apache/spark.git
6cd spark
7
8# Build with specific Hadoop version
9./build/mvn -DskipTests clean package -Phadoop-3.3 -Dhadoop.version=3.3.4
10
11# This process takes 30-60 minutes depending on your system
Create Spark configuration directory:
1cd $SPARK_HOME/conf
2cp spark-defaults.conf.template spark-defaults.conf
3cp spark-env.sh.template spark-env.sh
Essential spark-defaults.conf settings:
1# Edit spark-defaults.conf
2nano spark-defaults.conf
3
4# Add these essential configurations:
5spark.master spark://localhost:7077
6spark.eventLog.enabled true
7spark.eventLog.dir /tmp/spark-events
8spark.history.fs.logDirectory /tmp/spark-events
9spark.sql.warehouse.dir /tmp/spark-warehouse
10
11# Performance optimizations
12spark.executor.memory 2g
13spark.executor.cores 2
14spark.executor.instances 2
15spark.driver.memory 1g
16spark.driver.maxResultSize 1g
17
18# Enable dynamic allocation
19spark.dynamicAllocation.enabled true
20spark.dynamicAllocation.minExecutors 1
21spark.dynamicAllocation.maxExecutors 4
22
23# Kryo serialization for better performance
24spark.serializer org.apache.spark.serializer.KryoSerializer
Configure spark-env.sh:
1# Edit spark-env.sh
2nano spark-env.sh
3
4# Add essential environment variables:
5export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
6export SPARK_MASTER_HOST=localhost
7export SPARK_WORKER_MEMORY=2g
8export SPARK_WORKER_CORES=2
9export SPARK_WORKER_INSTANCES=1
Test 1: Spark Shell (Scala)
1# Start Spark shell
2spark-shell
3
4# Run in Spark shell:
5scala> val data = 1 to 10000
6scala> val distData = sc.parallelize(data)
7scala> distData.filter(_ < 10).collect()
8scala> :quit
Test 2: PySpark Shell
1# Start PySpark shell
2pyspark
3
4# Run in PySpark:
5>>> data = range(1, 10000)
6>>> distData = sc.parallelize(data)
7>>> distData.filter(lambda x: x < 10).collect()
8>>> exit()
Test 3: Submit Application
1# Run the classic Pi estimation example
2spark-submit \
3 --class org.apache.spark.examples.SparkPi \
4 --master local[2] \
5 $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
6 10
Expected output: Pi is roughly 3.141592653589793
Test with different configurations:
1# Test with local cluster
2spark-submit \
3 --class org.apache.spark.examples.SparkPi \
4 --master local[4] \
5 --driver-memory 2g \
6 --executor-memory 1g \
7 $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
8 100
Start Spark Master:
1# Start master node
2start-master.sh
3
4# Verify master is running
5# Open browser to http://localhost:8080
Start Spark Worker(s):
1# Start worker node
2start-worker.sh spark://localhost:7077
3
4# For multiple workers on same machine
5start-worker.sh -c 1 -m 1g spark://localhost:7077
6start-worker.sh -c 1 -m 1g spark://localhost:7077
Test cluster deployment:
1# Submit job to cluster
2spark-submit \
3 --class org.apache.spark.examples.SparkPi \
4 --master spark://localhost:7077 \
5 --executor-memory 1g \
6 --total-executor-cores 2 \
7 $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
8 100
For existing Hadoop clusters:
1# Ensure Spark is built with correct Hadoop version
2# Check your Hadoop version
3hadoop version
4
5# Download Spark pre-built for your Hadoop version
6# Example for Hadoop 3.3:
7wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
Test HDFS integration:
1# Start spark-shell with HDFS access
2spark-shell
3
4# Read from HDFS
5scala> val textFile = sc.textFile("hdfs://namenode:9000/path/to/input.txt")
6scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
7scala> wordCounts.saveAsTextFile("hdfs://namenode:9000/path/to/output")
IntelliJ IDEA setup:
1name := "SparkApplication"
2version := "0.1"
3scalaVersion := "2.12.17"
4
5libraryDependencies ++= Seq(
6 "org.apache.spark" %% "spark-core" % "3.5.0",
7 "org.apache.spark" %% "spark-sql" % "3.5.0",
8 "org.apache.spark" %% "spark-mllib" % "3.5.0"
9)
VS Code setup for PySpark:
1# Install Python extension
2# Install Pylint for Python linting
3pip3 install pylint
4
5# Create virtual environment for Spark projects
6python3 -m venv spark-env
7source spark-env/bin/activate
8pip install pyspark pandas numpy jupyter
1# Enable authentication
2spark.authenticate true
3spark.authenticate.secret yourSecretKey
4
5# SSL configuration
6spark.ssl.enabled true
7spark.ssl.keyStore /path/to/keystore.jks
8spark.ssl.keyStorePassword yourKeystorePassword
1# Enable history server
2start-history-server.sh
3
4# Access history UI at http://localhost:18080
5
6# Configure log levels
7cp log4j.properties.template log4j.properties
8# Edit log4j.properties for appropriate log levels
1# For YARN integration
2export HADOOP_CONF_DIR=/path/to/hadoop/conf
3spark-submit --master yarn --deploy-mode cluster your-application.jar
4
5# For Kubernetes deployment
6spark-submit \
7 --master k8s://https://kubernetes-master-url:443 \
8 --deploy-mode cluster \
9 --conf spark.kubernetes.container.image=spark-py:latest \
10 your-application.py
1# Increase driver memory
2spark-submit --driver-memory 4g your-app.jar
3
4# Configure executor memory
5spark-submit --executor-memory 2g your-app.jar
1# Ensure consistent Java version
2update-alternatives --config java
3export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
1# Configure specific network interface
2export SPARK_MASTER_HOST=192.168.1.100
3export SPARK_LOCAL_IP=192.168.1.100
1# Clean up event logs periodically
2find /tmp/spark-events -type f -mtime +30 -delete
3
4# Master UI: http://localhost:8080
5# History server UI: http://localhost:18080