Install Apache Spark 3.5 on Linux (Ubuntu, CentOS)

Q: How do I install PySpark on Ubuntu?

Install Python 3.8+ with sudo apt install python3 python3-pip, then install Spark and set PYSPARK_PYTHON=python3. You can also install the pyspark pip package (pip install pyspark) for local development without a full Spark installation.

Q: Can I run Spark on a single machine without a cluster?

Yes. Use --master local[*] when submitting jobs, which runs Spark in local mode using all available CPU cores. This is how most developers test and prototype before deploying to a cluster.

Q: How do I check which version of Spark is installed?

Run spark-shell --version or pyspark --version from the terminal. Both print the Spark version, Scala version, and Java version being used.

By Prabeesh Keezhathra

November 26, 2024 - 7 minutes read - 1280 words

To install Apache Spark 3.5 on Linux, install OpenJDK 17, download the Spark 3.5 binary from archive.apache.org/dist/spark, extract it to /opt, set SPARK_HOME and PATH, then verify with spark-shell. The full process takes about 15 minutes on a fresh Ubuntu or CentOS machine.

This is a walkthrough of installing Apache Spark 3.5 on modern Linux, from prerequisites through a working standalone cluster. Earlier versions are covered in Install Apache Spark 1.0 on Ubuntu 14.04 and Install Apache Spark 2 on Ubuntu 16.04 and macOS.

Prerequisites

Requirement	Version / Recommendation
Java	OpenJDK 17 (Spark 3.5 supports 8, 11, 17)
Python	3.8+ for PySpark
Memory	4 GB minimum, 8 GB+ for comfortable work
Storage	10 GB free for install + logs
OS	Ubuntu 20.04+, CentOS 7+, or equivalent

Install Java

 1# Ubuntu/Debian
 2sudo apt update
 3sudo apt install openjdk-17-jdk
 4
 5# CentOS/RHEL
 6sudo yum install java-17-openjdk-devel
 7
 8# Verify installation
 9java -version
10javac -version

Set JAVA_HOME environment variable:

1# Add to ~/.bashrc or ~/.profile
2echo 'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64' >> ~/.bashrc
3echo 'export PATH=$PATH:$JAVA_HOME/bin' >> ~/.bashrc
4source ~/.bashrc
5
6# Verify
7echo $JAVA_HOME

Install Python and essential packages (for PySpark):

1# Ubuntu/Debian
2sudo apt install python3 python3-pip python3-dev
3
4# CentOS/RHEL  
5sudo yum install python3 python3-pip python3-devel
6
7# Install essential Python packages
8pip3 install py4j pandas numpy matplotlib

Install Spark

Option 1: binary distribution (recommended)

Download the latest distribution:

1# Navigate to your preferred installation directory
2cd /opt
3
4# Download Spark 3.5+ (check spark.apache.org for latest version)
5sudo wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
6
7# Alternative: Download pre-built for specific Hadoop version
8# sudo wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3-scala2.13.tgz

Extract and set up the files:

1# Extract the archive
2sudo tar -xzf spark-3.5.0-bin-hadoop3.tgz
3
4# Create symbolic link for easier management
5sudo ln -sf spark-3.5.0-bin-hadoop3 spark
6
7# Set proper permissions
8sudo chown -R $USER:$USER /opt/spark-3.5.0-bin-hadoop3

Configure environment variables:

1# Add to ~/.bashrc
2echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
3echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> ~/.bashrc
4echo 'export PYSPARK_PYTHON=python3' >> ~/.bashrc
5echo 'export PYSPARK_DRIVER_PYTHON=python3' >> ~/.bashrc
6
7# Apply changes
8source ~/.bashrc

Option 2: build from source

For users needing specific configurations or latest development features:

 1# Install build dependencies
 2sudo apt install git maven scala
 3
 4# Clone Spark repository
 5git clone https://github.com/apache/spark.git
 6cd spark
 7
 8# Build with specific Hadoop version
 9./build/mvn -DskipTests clean package -Phadoop-3.3 -Dhadoop.version=3.3.4
10
11# This process takes 30-60 minutes depending on your system

Configuration

Basic configuration

Create Spark configuration directory:

1cd $SPARK_HOME/conf
2cp spark-defaults.conf.template spark-defaults.conf
3cp spark-env.sh.template spark-env.sh

Essential spark-defaults.conf settings:

 1# Edit spark-defaults.conf
 2nano spark-defaults.conf
 3
 4# Add these essential configurations:
 5spark.master                     spark://localhost:7077
 6spark.eventLog.enabled           true
 7spark.eventLog.dir               /tmp/spark-events
 8spark.history.fs.logDirectory    /tmp/spark-events
 9spark.sql.warehouse.dir          /tmp/spark-warehouse
10
11# Performance optimizations
12spark.executor.memory            2g
13spark.executor.cores             2
14spark.executor.instances         2
15spark.driver.memory              1g
16spark.driver.maxResultSize       1g
17
18# Enable dynamic allocation
19spark.dynamicAllocation.enabled  true
20spark.dynamicAllocation.minExecutors    1
21spark.dynamicAllocation.maxExecutors    4
22
23# Kryo serialization for better performance
24spark.serializer                 org.apache.spark.serializer.KryoSerializer

Configure spark-env.sh:

1# Edit spark-env.sh
2nano spark-env.sh
3
4# Add essential environment variables:
5export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
6export SPARK_MASTER_HOST=localhost
7export SPARK_WORKER_MEMORY=2g
8export SPARK_WORKER_CORES=2
9export SPARK_WORKER_INSTANCES=1

Smoke tests

Basic functionality

Test 1: Spark Shell (Scala)

1# Start Spark shell
2spark-shell
3
4# Run in Spark shell:
5scala> val data = 1 to 10000
6scala> val distData = sc.parallelize(data)
7scala> distData.filter(_ < 10).collect()
8scala> :quit

Test 2: PySpark Shell

1# Start PySpark shell
2pyspark
3
4# Run in PySpark:
5>>> data = range(1, 10000)
6>>> distData = sc.parallelize(data)
7>>> distData.filter(lambda x: x < 10).collect()
8>>> exit()

Test 3: Submit Application

1# Run the classic Pi estimation example
2spark-submit \
3    --class org.apache.spark.examples.SparkPi \
4    --master local[2] \
5    $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
6    10

Expected output: Pi is roughly 3.141592653589793

Performance validation tests

Test with different configurations:

1# Test with local cluster
2spark-submit \
3    --class org.apache.spark.examples.SparkPi \
4    --master local[4] \
5    --driver-memory 2g \
6    --executor-memory 1g \
7    $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
8    100

Cluster setup (optional but recommended)

Standalone cluster configuration

Start Spark Master:

1# Start master node
2start-master.sh
3
4# Verify master is running
5# Open browser to http://localhost:8080

Start Spark Worker(s):

1# Start worker node
2start-worker.sh spark://localhost:7077
3
4# For multiple workers on same machine
5start-worker.sh -c 1 -m 1g spark://localhost:7077
6start-worker.sh -c 1 -m 1g spark://localhost:7077

Test cluster deployment:

1# Submit job to cluster
2spark-submit \
3    --class org.apache.spark.examples.SparkPi \
4    --master spark://localhost:7077 \
5    --executor-memory 1g \
6    --total-executor-cores 2 \
7    $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar \
8    100

Hadoop integration

Working with HDFS

For existing Hadoop clusters:

1# Ensure Spark is built with correct Hadoop version
2# Check your Hadoop version
3hadoop version
4
5# Download Spark pre-built for your Hadoop version
6# Example for Hadoop 3.3:
7wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

Test HDFS integration:

1# Start spark-shell with HDFS access
2spark-shell
3
4# Read from HDFS
5scala> val textFile = sc.textFile("hdfs://namenode:9000/path/to/input.txt")
6scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
7scala> wordCounts.saveAsTextFile("hdfs://namenode:9000/path/to/output")

Development environment setup

IDE integration

IntelliJ IDEA setup:

Install Scala plugin
Create new SBT project
Add Spark dependencies to build.sbt:

1name := "SparkApplication"
2version := "0.1"
3scalaVersion := "2.12.17"
4
5libraryDependencies ++= Seq(
6  "org.apache.spark" %% "spark-core" % "3.5.0",
7  "org.apache.spark" %% "spark-sql" % "3.5.0",
8  "org.apache.spark" %% "spark-mllib" % "3.5.0"
9)

VS Code setup for PySpark:

1# Install Python extension
2# Install Pylint for Python linting
3pip3 install pylint
4
5# Create virtual environment for Spark projects
6python3 -m venv spark-env
7source spark-env/bin/activate
8pip install pyspark pandas numpy jupyter

Production deployment considerations

Security configuration

1# Enable authentication
2spark.authenticate true
3spark.authenticate.secret yourSecretKey
4
5# SSL configuration
6spark.ssl.enabled true
7spark.ssl.keyStore /path/to/keystore.jks
8spark.ssl.keyStorePassword yourKeystorePassword

Monitoring and logging

1# Enable history server
2start-history-server.sh
3
4# Access history UI at http://localhost:18080
5
6# Configure log levels
7cp log4j.properties.template log4j.properties
8# Edit log4j.properties for appropriate log levels

Resource management

 1# For YARN integration
 2export HADOOP_CONF_DIR=/path/to/hadoop/conf
 3spark-submit --master yarn --deploy-mode cluster your-application.jar
 4
 5# For Kubernetes deployment
 6spark-submit \
 7    --master k8s://https://kubernetes-master-url:443 \
 8    --deploy-mode cluster \
 9    --conf spark.kubernetes.container.image=spark-py:latest \
10    your-application.py

Troubleshooting common issues

Memory issues

1# Increase driver memory
2spark-submit --driver-memory 4g your-app.jar
3
4# Configure executor memory
5spark-submit --executor-memory 2g your-app.jar

Java version conflicts

1# Ensure consistent Java version
2update-alternatives --config java
3export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64

Network binding issues

1# Configure specific network interface
2export SPARK_MASTER_HOST=192.168.1.100
3export SPARK_LOCAL_IP=192.168.1.100

Next steps

Performance tuning for a working cluster
Standalone Scala applications as a starter template
PySpark with Docker for a reproducible notebook environment

Maintenance

1# Clean up event logs periodically
2find /tmp/spark-events -type f -mtime +30 -delete
3
4# Master UI:         http://localhost:8080
5# History server UI: http://localhost:18080

Frequently asked questions

What Java version does Apache Spark 3.5 require?

Apache Spark 3.5 officially supports Java 8, 11, and 17. OpenJDK 17 is the recommended choice for new installs because it is actively maintained and has the best performance characteristics for Spark workloads.

How do I install PySpark on Ubuntu?

Install Python 3.8+ with sudo apt install python3 python3-pip, then install Spark as described above and set PYSPARK_PYTHON=python3. You can also install the pyspark pip package (pip install pyspark) for local development without a full Spark installation.

Can I run Spark on a single machine without a cluster?

Yes. Use --master local[*] when submitting jobs, which runs Spark in local mode using all available CPU cores. This is how most developers test and prototype before deploying to a cluster.

How do I check which version of Spark is installed?

Run spark-shell --version or pyspark --version from the terminal. Both print the Spark version, Scala version, and Java version being used.

What is the difference between standalone, YARN, and Kubernetes deploy modes?

Standalone mode is Spark’s built-in cluster manager, simplest to set up. YARN integrates with an existing Hadoop cluster and shares resources with other Hadoop workloads. Kubernetes runs Spark executors as pods, which is the standard for cloud-native deployments. Choose standalone for learning and small teams, YARN for Hadoop shops, Kubernetes for containerized environments.