Complete Guide: Install Apache Spark on Linux (Ubuntu, CentOS) - 2024 Updated

By Prabeesh Keezhathra

November 26, 2013 - 6 minutes read - 1229 words

Apache Spark has evolved dramatically since its early releases, becoming the de facto standard for large-scale data processing and analytics. This comprehensive guide covers installing the latest Apache Spark 3.5+ on modern Linux distributions with best practices for both development and production environments.

Update Notice: This guide covers modern Apache Spark 3.5+ installation. For historical reference, our previous guides covered Apache Spark 1.0 installation and Apache Spark 2.x setup.

Apache Spark is an open-source, distributed computing framework designed for fast processing of large datasets across clusters. Originally developed at UC Berkeley’s AMPLab, Spark provides unified analytics capabilities including batch processing, real-time streaming, machine learning, and graph processing with clean APIs in Scala, Java, Python, and R.

Prerequisites and System Requirements

Minimum System Requirements

Java: OpenJDK 8, 11, or 17 (Java 17 recommended for Spark 3.5+)
Python: 3.8+ (for PySpark usage)
Memory: 4GB RAM minimum, 8GB+ recommended
Storage: 10GB free space for installation and logs
OS: Ubuntu 20.04+, CentOS 7+, or equivalent Linux distribution

Pre-Installation Setup

Install Java (OpenJDK 17 recommended):

 1# Ubuntu/Debian
 2sudo apt update
 3sudo apt install openjdk-17-jdk
 4
 5# CentOS/RHEL
 6sudo yum install java-17-openjdk-devel
 7
 8# Verify installation
 9java -version
10javac -version

Set JAVA_HOME environment variable:

1# Add to ~/.bashrc or ~/.profile
2echo 'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64' >> ~/.bashrc
3echo 'export PATH=$PATH:$JAVA_HOME/bin' >> ~/.bashrc
4source ~/.bashrc
5
6# Verify
7echo $JAVA_HOME

Install Python and essential packages (for PySpark):

1# Ubuntu/Debian
2sudo apt install python3 python3-pip python3-dev
3
4# CentOS/RHEL  
5sudo yum install python3 python3-pip python3-devel
6
7# Install essential Python packages
8pip3 install py4j pandas numpy matplotlib

Apache Spark Installation Methods

Method 1: Binary Distribution (Recommended for Most Users)

Download the latest Spark distribution:

1# Navigate to your preferred installation directory
2cd /opt
3
4# Download Spark 3.5+ (check spark.apache.org for latest version)
5sudo wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
6
7# Alternative: Download pre-built for specific Hadoop version
8# sudo wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3-scala2.13.tgz

Extract and setup Spark:

1# Extract the archive
2sudo tar -xzf spark-3.5.0-bin-hadoop3.tgz
3
4# Create symbolic link for easier management
5sudo ln -sf spark-3.5.0-bin-hadoop3 spark
6
7# Set proper permissions
8sudo chown -R $USER:$USER /opt/spark-3.5.0-bin-hadoop3

Configure environment variables:

1# Add to ~/.bashrc
2echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
3echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> ~/.bashrc
4echo 'export PYSPARK_PYTHON=python3' >> ~/.bashrc
5echo 'export PYSPARK_DRIVER_PYTHON=python3' >> ~/.bashrc
6
7# Apply changes
8source ~/.bashrc

Method 2: Building from Source (Advanced Users)

For users needing specific configurations or latest development features:

 1# Install build dependencies
 2sudo apt install git maven scala
 3
 4# Clone Spark repository
 5git clone https://github.com/apache/spark.git
 6cd spark
 7
 8# Build with specific Hadoop version
 9./build/mvn -DskipTests clean package -Phadoop-3.3 -Dhadoop.version=3.3.4
10
11# This process takes 30-60 minutes depending on your system

Spark Configuration and Optimization

Basic Configuration

Create Spark configuration directory:

1cd $SPARK_HOME/conf
2cp spark-defaults.conf.template spark-defaults.conf
3cp spark-env.sh.template spark-env.sh

Essential spark-defaults.conf settings:

 1# Edit spark-defaults.conf
 2nano spark-defaults.conf
 3
 4# Add these essential configurations:
 5spark.master                     spark://localhost:7077
 6spark.eventLog.enabled           true
 7spark.eventLog.dir               /tmp/spark-events
 8spark.history.fs.logDirectory    /tmp/spark-events
 9spark.sql.warehouse.dir          /tmp/spark-warehouse
10
11# Performance optimizations
12spark.executor.memory            2g
13spark.executor.cores             2
14spark.executor.instances         2
15spark.driver.memory              1g
16spark.driver.maxResultSize       1g
17
18# Enable dynamic allocation
19spark.dynamicAllocation.enabled  true
20spark.dynamicAllocation.minExecutors    1
21spark.dynamicAllocation.maxExecutors    4
22
23# Kryo serialization for better performance
24spark.serializer                 org.apache.spark.serializer.KryoSerializer

Configure spark-env.sh:

1# Edit spark-env.sh
2nano spark-env.sh
3
4# Add essential environment variables:
5export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
6export SPARK_MASTER_HOST=localhost
7export SPARK_WORKER_MEMORY=2g
8export SPARK_WORKER_CORES=2
9export SPARK_WORKER_INSTANCES=1

Testing Your Spark Installation

Basic Functionality Tests

Test 1: Spark Shell (Scala)

1# Start Spark shell
2spark-shell
3
4# Run in Spark shell:
5scala> val data = 1 to 10000
6scala> val distData = sc.parallelize(data)
7scala> distData.filter(_ < 10).collect()
8scala> :quit

Test 2: PySpark Shell

1# Start PySpark shell
2pyspark
3
4# Run in PySpark:
5>>> data = range(1, 10000)
6>>> distData = sc.parallelize(data)
7>>> distData.filter(lambda x: x < 10).collect()
8>>> exit()

Test 3: Submit Application

1# Run the classic Pi estimation example
2spark-submit \
3    --class org.apache.spark.examples.SparkPi \
4    --master local[2] \
5    $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
6    10

Expected output: Pi is roughly 3.141592653589793

Performance Validation Tests

Test Spark with different configurations:

1# Test with local cluster
2spark-submit \
3    --class org.apache.spark.examples.SparkPi \
4    --master local[4] \
5    --driver-memory 2g \
6    --executor-memory 1g \
7    $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
8    100

Cluster Setup (Optional but Recommended)

Standalone Cluster Configuration

Start Spark Master:

1# Start master node
2start-master.sh
3
4# Verify master is running
5# Open browser to http://localhost:8080

Start Spark Worker(s):

1# Start worker node
2start-worker.sh spark://localhost:7077
3
4# For multiple workers on same machine
5start-worker.sh -c 1 -m 1g spark://localhost:7077
6start-worker.sh -c 1 -m 1g spark://localhost:7077

Test cluster deployment:

1# Submit job to cluster
2spark-submit \
3    --class org.apache.spark.examples.SparkPi \
4    --master spark://localhost:7077 \
5    --executor-memory 1g \
6    --total-executor-cores 2 \
7    $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
8    100

Hadoop Integration

Working with HDFS

For existing Hadoop clusters:

1# Ensure Spark is built with correct Hadoop version
2# Check your Hadoop version
3hadoop version
4
5# Download Spark pre-built for your Hadoop version
6# Example for Hadoop 3.3:
7wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

Test HDFS integration:

1# Start spark-shell with HDFS access
2spark-shell
3
4# Read from HDFS
5scala> val textFile = sc.textFile("hdfs://namenode:9000/path/to/input.txt")
6scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
7scala> wordCounts.saveAsTextFile("hdfs://namenode:9000/path/to/output")

Development Environment Setup

IDE Integration

IntelliJ IDEA Setup:

Install Scala plugin
Create new SBT project
Add Spark dependencies to build.sbt:

1name := "SparkApplication"
2version := "0.1"
3scalaVersion := "2.12.17"
4
5libraryDependencies ++= Seq(
6  "org.apache.spark" %% "spark-core" % "3.5.0",
7  "org.apache.spark" %% "spark-sql" % "3.5.0",
8  "org.apache.spark" %% "spark-mllib" % "3.5.0"
9)

VS Code Setup for PySpark:

1# Install Python extension
2# Install Pylint for Python linting
3pip3 install pylint
4
5# Create virtual environment for Spark projects
6python3 -m venv spark-env
7source spark-env/bin/activate
8pip install pyspark pandas numpy jupyter

Production Deployment Considerations

Security Configuration

1# Enable authentication
2spark.authenticate true
3spark.authenticate.secret yourSecretKey
4
5# SSL configuration
6spark.ssl.enabled true
7spark.ssl.keyStore /path/to/keystore.jks
8spark.ssl.keyStorePassword yourKeystorePassword

Monitoring and Logging

1# Enable history server
2start-history-server.sh
3
4# Access history UI at http://localhost:18080
5
6# Configure log levels
7cp log4j.properties.template log4j.properties
8# Edit log4j.properties for appropriate log levels

Resource Management

 1# For YARN integration
 2export HADOOP_CONF_DIR=/path/to/hadoop/conf
 3spark-submit --master yarn --deploy-mode cluster your-application.jar
 4
 5# For Kubernetes deployment
 6spark-submit \
 7    --master k8s://https://kubernetes-master-url:443 \
 8    --deploy-mode cluster \
 9    --conf spark.kubernetes.container.image=spark-py:latest \
10    your-application.py

Troubleshooting Common Issues

Memory Issues

1# Increase driver memory
2spark-submit --driver-memory 4g your-app.jar
3
4# Configure executor memory
5spark-submit --executor-memory 2g your-app.jar

Java Version Conflicts

1# Ensure consistent Java version
2update-alternatives --config java
3export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64

Network Binding Issues

1# Configure specific network interface
2export SPARK_MASTER_HOST=192.168.1.100
3export SPARK_LOCAL_IP=192.168.1.100

Next Steps and Advanced Topics

Now that you have Spark installed and configured, explore these advanced topics:

Performance Tuning: Learn advanced Spark performance optimization techniques
Application Development: Create standalone Spark applications in Scala
Container Deployment: Set up PySpark with Docker for reproducible environments
Streaming Applications: Build real-time data processing pipelines
Machine Learning: Implement MLlib algorithms for production ML workflows

Maintenance and Updates

Regular maintenance tasks:

1# Update Spark (backup configurations first)
2# Download new version and update SPARK_HOME
3
4# Clean up logs periodically
5find /tmp/spark-events -type f -mtime +30 -delete
6
7# Monitor cluster health
8# Check master UI: http://localhost:8080
9# Check history server: http://localhost:18080

This comprehensive installation guide provides a solid foundation for Apache Spark development and deployment. Whether you’re building data analytics pipelines, machine learning models, or real-time streaming applications, this setup will serve as your reliable starting point.

For more advanced Spark tutorials and best practices, explore our complete Apache Spark tutorial series covering performance optimization, application development, and production deployment strategies.