Install Apache Spark 2 on Ubuntu 16.04 and macOS: Complete Setup Guide
Two of the earlier posts are discussing installing Apache Spark-0.8.0 and Apache Spark-1.1.0 on Ubuntu-12.04 and …
By Prabeesh Keezhathra
- 6 minutes read - 1229 wordsApache Spark has evolved dramatically since its early releases, becoming the de facto standard for large-scale data processing and analytics. This comprehensive guide covers installing the latest Apache Spark 3.5+ on modern Linux distributions with best practices for both development and production environments.
Update Notice: This guide covers modern Apache Spark 3.5+ installation. For historical reference, our previous guides covered Apache Spark 1.0 installation and Apache Spark 2.x setup.
Apache Spark is an open-source, distributed computing framework designed for fast processing of large datasets across clusters. Originally developed at UC Berkeley’s AMPLab, Spark provides unified analytics capabilities including batch processing, real-time streaming, machine learning, and graph processing with clean APIs in Scala, Java, Python, and R.
Install Java (OpenJDK 17 recommended):
1# Ubuntu/Debian
2sudo apt update
3sudo apt install openjdk-17-jdk
4
5# CentOS/RHEL
6sudo yum install java-17-openjdk-devel
7
8# Verify installation
9java -version
10javac -version
Set JAVA_HOME environment variable:
1# Add to ~/.bashrc or ~/.profile
2echo 'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64' >> ~/.bashrc
3echo 'export PATH=$PATH:$JAVA_HOME/bin' >> ~/.bashrc
4source ~/.bashrc
5
6# Verify
7echo $JAVA_HOME
Install Python and essential packages (for PySpark):
1# Ubuntu/Debian
2sudo apt install python3 python3-pip python3-dev
3
4# CentOS/RHEL
5sudo yum install python3 python3-pip python3-devel
6
7# Install essential Python packages
8pip3 install py4j pandas numpy matplotlib
Download the latest Spark distribution:
1# Navigate to your preferred installation directory
2cd /opt
3
4# Download Spark 3.5+ (check spark.apache.org for latest version)
5sudo wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
6
7# Alternative: Download pre-built for specific Hadoop version
8# sudo wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3-scala2.13.tgz
Extract and setup Spark:
1# Extract the archive
2sudo tar -xzf spark-3.5.0-bin-hadoop3.tgz
3
4# Create symbolic link for easier management
5sudo ln -sf spark-3.5.0-bin-hadoop3 spark
6
7# Set proper permissions
8sudo chown -R $USER:$USER /opt/spark-3.5.0-bin-hadoop3
Configure environment variables:
1# Add to ~/.bashrc
2echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
3echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> ~/.bashrc
4echo 'export PYSPARK_PYTHON=python3' >> ~/.bashrc
5echo 'export PYSPARK_DRIVER_PYTHON=python3' >> ~/.bashrc
6
7# Apply changes
8source ~/.bashrc
For users needing specific configurations or latest development features:
1# Install build dependencies
2sudo apt install git maven scala
3
4# Clone Spark repository
5git clone https://github.com/apache/spark.git
6cd spark
7
8# Build with specific Hadoop version
9./build/mvn -DskipTests clean package -Phadoop-3.3 -Dhadoop.version=3.3.4
10
11# This process takes 30-60 minutes depending on your system
Create Spark configuration directory:
1cd $SPARK_HOME/conf
2cp spark-defaults.conf.template spark-defaults.conf
3cp spark-env.sh.template spark-env.sh
Essential spark-defaults.conf settings:
1# Edit spark-defaults.conf
2nano spark-defaults.conf
3
4# Add these essential configurations:
5spark.master spark://localhost:7077
6spark.eventLog.enabled true
7spark.eventLog.dir /tmp/spark-events
8spark.history.fs.logDirectory /tmp/spark-events
9spark.sql.warehouse.dir /tmp/spark-warehouse
10
11# Performance optimizations
12spark.executor.memory 2g
13spark.executor.cores 2
14spark.executor.instances 2
15spark.driver.memory 1g
16spark.driver.maxResultSize 1g
17
18# Enable dynamic allocation
19spark.dynamicAllocation.enabled true
20spark.dynamicAllocation.minExecutors 1
21spark.dynamicAllocation.maxExecutors 4
22
23# Kryo serialization for better performance
24spark.serializer org.apache.spark.serializer.KryoSerializer
Configure spark-env.sh:
1# Edit spark-env.sh
2nano spark-env.sh
3
4# Add essential environment variables:
5export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
6export SPARK_MASTER_HOST=localhost
7export SPARK_WORKER_MEMORY=2g
8export SPARK_WORKER_CORES=2
9export SPARK_WORKER_INSTANCES=1
Test 1: Spark Shell (Scala)
1# Start Spark shell
2spark-shell
3
4# Run in Spark shell:
5scala> val data = 1 to 10000
6scala> val distData = sc.parallelize(data)
7scala> distData.filter(_ < 10).collect()
8scala> :quit
Test 2: PySpark Shell
1# Start PySpark shell
2pyspark
3
4# Run in PySpark:
5>>> data = range(1, 10000)
6>>> distData = sc.parallelize(data)
7>>> distData.filter(lambda x: x < 10).collect()
8>>> exit()
Test 3: Submit Application
1# Run the classic Pi estimation example
2spark-submit \
3 --class org.apache.spark.examples.SparkPi \
4 --master local[2] \
5 $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
6 10
Expected output: Pi is roughly 3.141592653589793
Test Spark with different configurations:
1# Test with local cluster
2spark-submit \
3 --class org.apache.spark.examples.SparkPi \
4 --master local[4] \
5 --driver-memory 2g \
6 --executor-memory 1g \
7 $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
8 100
Start Spark Master:
1# Start master node
2start-master.sh
3
4# Verify master is running
5# Open browser to http://localhost:8080
Start Spark Worker(s):
1# Start worker node
2start-worker.sh spark://localhost:7077
3
4# For multiple workers on same machine
5start-worker.sh -c 1 -m 1g spark://localhost:7077
6start-worker.sh -c 1 -m 1g spark://localhost:7077
Test cluster deployment:
1# Submit job to cluster
2spark-submit \
3 --class org.apache.spark.examples.SparkPi \
4 --master spark://localhost:7077 \
5 --executor-memory 1g \
6 --total-executor-cores 2 \
7 $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
8 100
For existing Hadoop clusters:
1# Ensure Spark is built with correct Hadoop version
2# Check your Hadoop version
3hadoop version
4
5# Download Spark pre-built for your Hadoop version
6# Example for Hadoop 3.3:
7wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
Test HDFS integration:
1# Start spark-shell with HDFS access
2spark-shell
3
4# Read from HDFS
5scala> val textFile = sc.textFile("hdfs://namenode:9000/path/to/input.txt")
6scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
7scala> wordCounts.saveAsTextFile("hdfs://namenode:9000/path/to/output")
IntelliJ IDEA Setup:
1name := "SparkApplication"
2version := "0.1"
3scalaVersion := "2.12.17"
4
5libraryDependencies ++= Seq(
6 "org.apache.spark" %% "spark-core" % "3.5.0",
7 "org.apache.spark" %% "spark-sql" % "3.5.0",
8 "org.apache.spark" %% "spark-mllib" % "3.5.0"
9)
VS Code Setup for PySpark:
1# Install Python extension
2# Install Pylint for Python linting
3pip3 install pylint
4
5# Create virtual environment for Spark projects
6python3 -m venv spark-env
7source spark-env/bin/activate
8pip install pyspark pandas numpy jupyter
1# Enable authentication
2spark.authenticate true
3spark.authenticate.secret yourSecretKey
4
5# SSL configuration
6spark.ssl.enabled true
7spark.ssl.keyStore /path/to/keystore.jks
8spark.ssl.keyStorePassword yourKeystorePassword
1# Enable history server
2start-history-server.sh
3
4# Access history UI at http://localhost:18080
5
6# Configure log levels
7cp log4j.properties.template log4j.properties
8# Edit log4j.properties for appropriate log levels
1# For YARN integration
2export HADOOP_CONF_DIR=/path/to/hadoop/conf
3spark-submit --master yarn --deploy-mode cluster your-application.jar
4
5# For Kubernetes deployment
6spark-submit \
7 --master k8s://https://kubernetes-master-url:443 \
8 --deploy-mode cluster \
9 --conf spark.kubernetes.container.image=spark-py:latest \
10 your-application.py
1# Increase driver memory
2spark-submit --driver-memory 4g your-app.jar
3
4# Configure executor memory
5spark-submit --executor-memory 2g your-app.jar
1# Ensure consistent Java version
2update-alternatives --config java
3export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
1# Configure specific network interface
2export SPARK_MASTER_HOST=192.168.1.100
3export SPARK_LOCAL_IP=192.168.1.100
Now that you have Spark installed and configured, explore these advanced topics:
Regular maintenance tasks:
1# Update Spark (backup configurations first)
2# Download new version and update SPARK_HOME
3
4# Clean up logs periodically
5find /tmp/spark-events -type f -mtime +30 -delete
6
7# Monitor cluster health
8# Check master UI: http://localhost:8080
9# Check history server: http://localhost:18080
This comprehensive installation guide provides a solid foundation for Apache Spark development and deployment. Whether you’re building data analytics pipelines, machine learning models, or real-time streaming applications, this setup will serve as your reliable starting point.
For more advanced Spark tutorials and best practices, explore our complete Apache Spark tutorial series covering performance optimization, application development, and production deployment strategies.