Install Apache Spark 2 on Ubuntu 16.04 and macOS
Earlier posts covered Spark 1.1.0 on Ubuntu 14.04. This one walks through Spark 2.0.2 on Ubuntu 16.04 and Mac OS X …
By Prabeesh Keezhathra
- 7 minutes read - 1280 wordsTo install Apache Spark 3.5 on Linux, install OpenJDK 17, download the Spark 3.5 binary from archive.apache.org/dist/spark, extract it to /opt, set SPARK_HOME and PATH, then verify with spark-shell. The full process takes about 15 minutes on a fresh Ubuntu or CentOS machine.
This is a walkthrough of installing Apache Spark 3.5 on modern Linux, from prerequisites through a working standalone cluster. Earlier versions are covered in Install Apache Spark 1.0 on Ubuntu 14.04 and Install Apache Spark 2 on Ubuntu 16.04 and macOS.
| Requirement | Version / Recommendation |
|---|---|
| Java | OpenJDK 17 (Spark 3.5 supports 8, 11, 17) |
| Python | 3.8+ for PySpark |
| Memory | 4 GB minimum, 8 GB+ for comfortable work |
| Storage | 10 GB free for install + logs |
| OS | Ubuntu 20.04+, CentOS 7+, or equivalent |
1# Ubuntu/Debian
2sudo apt update
3sudo apt install openjdk-17-jdk
4
5# CentOS/RHEL
6sudo yum install java-17-openjdk-devel
7
8# Verify installation
9java -version
10javac -version
Set JAVA_HOME environment variable:
1# Add to ~/.bashrc or ~/.profile
2echo 'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64' >> ~/.bashrc
3echo 'export PATH=$PATH:$JAVA_HOME/bin' >> ~/.bashrc
4source ~/.bashrc
5
6# Verify
7echo $JAVA_HOME
Install Python and essential packages (for PySpark):
1# Ubuntu/Debian
2sudo apt install python3 python3-pip python3-dev
3
4# CentOS/RHEL
5sudo yum install python3 python3-pip python3-devel
6
7# Install essential Python packages
8pip3 install py4j pandas numpy matplotlib
Download the latest distribution:
1# Navigate to your preferred installation directory
2cd /opt
3
4# Download Spark 3.5+ (check spark.apache.org for latest version)
5sudo wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
6
7# Alternative: Download pre-built for specific Hadoop version
8# sudo wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3-scala2.13.tgz
Extract and set up the files:
1# Extract the archive
2sudo tar -xzf spark-3.5.0-bin-hadoop3.tgz
3
4# Create symbolic link for easier management
5sudo ln -sf spark-3.5.0-bin-hadoop3 spark
6
7# Set proper permissions
8sudo chown -R $USER:$USER /opt/spark-3.5.0-bin-hadoop3
Configure environment variables:
1# Add to ~/.bashrc
2echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
3echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> ~/.bashrc
4echo 'export PYSPARK_PYTHON=python3' >> ~/.bashrc
5echo 'export PYSPARK_DRIVER_PYTHON=python3' >> ~/.bashrc
6
7# Apply changes
8source ~/.bashrc
For users needing specific configurations or latest development features:
1# Install build dependencies
2sudo apt install git maven scala
3
4# Clone Spark repository
5git clone https://github.com/apache/spark.git
6cd spark
7
8# Build with specific Hadoop version
9./build/mvn -DskipTests clean package -Phadoop-3.3 -Dhadoop.version=3.3.4
10
11# This process takes 30-60 minutes depending on your system
Create Spark configuration directory:
1cd $SPARK_HOME/conf
2cp spark-defaults.conf.template spark-defaults.conf
3cp spark-env.sh.template spark-env.sh
Essential spark-defaults.conf settings:
1# Edit spark-defaults.conf
2nano spark-defaults.conf
3
4# Add these essential configurations:
5spark.master spark://localhost:7077
6spark.eventLog.enabled true
7spark.eventLog.dir /tmp/spark-events
8spark.history.fs.logDirectory /tmp/spark-events
9spark.sql.warehouse.dir /tmp/spark-warehouse
10
11# Performance optimizations
12spark.executor.memory 2g
13spark.executor.cores 2
14spark.executor.instances 2
15spark.driver.memory 1g
16spark.driver.maxResultSize 1g
17
18# Enable dynamic allocation
19spark.dynamicAllocation.enabled true
20spark.dynamicAllocation.minExecutors 1
21spark.dynamicAllocation.maxExecutors 4
22
23# Kryo serialization for better performance
24spark.serializer org.apache.spark.serializer.KryoSerializer
Configure spark-env.sh:
1# Edit spark-env.sh
2nano spark-env.sh
3
4# Add essential environment variables:
5export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
6export SPARK_MASTER_HOST=localhost
7export SPARK_WORKER_MEMORY=2g
8export SPARK_WORKER_CORES=2
9export SPARK_WORKER_INSTANCES=1
Test 1: Spark Shell (Scala)
1# Start Spark shell
2spark-shell
3
4# Run in Spark shell:
5scala> val data = 1 to 10000
6scala> val distData = sc.parallelize(data)
7scala> distData.filter(_ < 10).collect()
8scala> :quit
Test 2: PySpark Shell
1# Start PySpark shell
2pyspark
3
4# Run in PySpark:
5>>> data = range(1, 10000)
6>>> distData = sc.parallelize(data)
7>>> distData.filter(lambda x: x < 10).collect()
8>>> exit()
Test 3: Submit Application
1# Run the classic Pi estimation example
2spark-submit \
3 --class org.apache.spark.examples.SparkPi \
4 --master local[2] \
5 $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
6 10
Expected output: Pi is roughly 3.141592653589793
Test with different configurations:
1# Test with local cluster
2spark-submit \
3 --class org.apache.spark.examples.SparkPi \
4 --master local[4] \
5 --driver-memory 2g \
6 --executor-memory 1g \
7 $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
8 100
Start Spark Master:
1# Start master node
2start-master.sh
3
4# Verify master is running
5# Open browser to http://localhost:8080
Start Spark Worker(s):
1# Start worker node
2start-worker.sh spark://localhost:7077
3
4# For multiple workers on same machine
5start-worker.sh -c 1 -m 1g spark://localhost:7077
6start-worker.sh -c 1 -m 1g spark://localhost:7077
Test cluster deployment:
1# Submit job to cluster
2spark-submit \
3 --class org.apache.spark.examples.SparkPi \
4 --master spark://localhost:7077 \
5 --executor-memory 1g \
6 --total-executor-cores 2 \
7 $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
8 100
For existing Hadoop clusters:
1# Ensure Spark is built with correct Hadoop version
2# Check your Hadoop version
3hadoop version
4
5# Download Spark pre-built for your Hadoop version
6# Example for Hadoop 3.3:
7wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
Test HDFS integration:
1# Start spark-shell with HDFS access
2spark-shell
3
4# Read from HDFS
5scala> val textFile = sc.textFile("hdfs://namenode:9000/path/to/input.txt")
6scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
7scala> wordCounts.saveAsTextFile("hdfs://namenode:9000/path/to/output")
IntelliJ IDEA setup:
1name := "SparkApplication"
2version := "0.1"
3scalaVersion := "2.12.17"
4
5libraryDependencies ++= Seq(
6 "org.apache.spark" %% "spark-core" % "3.5.0",
7 "org.apache.spark" %% "spark-sql" % "3.5.0",
8 "org.apache.spark" %% "spark-mllib" % "3.5.0"
9)
VS Code setup for PySpark:
1# Install Python extension
2# Install Pylint for Python linting
3pip3 install pylint
4
5# Create virtual environment for Spark projects
6python3 -m venv spark-env
7source spark-env/bin/activate
8pip install pyspark pandas numpy jupyter
1# Enable authentication
2spark.authenticate true
3spark.authenticate.secret yourSecretKey
4
5# SSL configuration
6spark.ssl.enabled true
7spark.ssl.keyStore /path/to/keystore.jks
8spark.ssl.keyStorePassword yourKeystorePassword
1# Enable history server
2start-history-server.sh
3
4# Access history UI at http://localhost:18080
5
6# Configure log levels
7cp log4j.properties.template log4j.properties
8# Edit log4j.properties for appropriate log levels
1# For YARN integration
2export HADOOP_CONF_DIR=/path/to/hadoop/conf
3spark-submit --master yarn --deploy-mode cluster your-application.jar
4
5# For Kubernetes deployment
6spark-submit \
7 --master k8s://https://kubernetes-master-url:443 \
8 --deploy-mode cluster \
9 --conf spark.kubernetes.container.image=spark-py:latest \
10 your-application.py
1# Increase driver memory
2spark-submit --driver-memory 4g your-app.jar
3
4# Configure executor memory
5spark-submit --executor-memory 2g your-app.jar
1# Ensure consistent Java version
2update-alternatives --config java
3export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
1# Configure specific network interface
2export SPARK_MASTER_HOST=192.168.1.100
3export SPARK_LOCAL_IP=192.168.1.100
1# Clean up event logs periodically
2find /tmp/spark-events -type f -mtime +30 -delete
3
4# Master UI: http://localhost:8080
5# History server UI: http://localhost:18080
Apache Spark 3.5 officially supports Java 8, 11, and 17. OpenJDK 17 is the recommended choice for new installs because it is actively maintained and has the best performance characteristics for Spark workloads.
Install Python 3.8+ with sudo apt install python3 python3-pip, then install Spark as described above and set PYSPARK_PYTHON=python3. You can also install the pyspark pip package (pip install pyspark) for local development without a full Spark installation.
Yes. Use --master local[*] when submitting jobs, which runs Spark in local mode using all available CPU cores. This is how most developers test and prototype before deploying to a cluster.
Run spark-shell --version or pyspark --version from the terminal. Both print the Spark version, Scala version, and Java version being used.
Standalone mode is Spark’s built-in cluster manager, simplest to set up. YARN integrates with an existing Hadoop cluster and shares resources with other Hadoop workloads. Kubernetes runs Spark executors as pods, which is the standard for cloud-native deployments. Choose standalone for learning and small teams, YARN for Hadoop shops, Kubernetes for containerized environments.