How to Run a PySpark Notebook with Docker

Q: How do I run PySpark in a Jupyter notebook without installing Spark locally?

Use Docker: docker run -d -t -p 8888:8888 jupyter/pyspark-notebook. This gives you a Jupyter environment with PySpark pre-installed. No local Java or Spark setup required.

Q: How do I persist my notebooks when using PySpark with Docker?

Mount a local directory as a volume: docker run -d -t -p 8888:8888 -v $(pwd)/notebooks:/home/jovyan/work jupyter/pyspark-notebook. Files saved in /home/jovyan/work inside the container will persist on your host machine.

Q: Which Docker image should I use for PySpark in 2024+?

Use the official jupyter/pyspark-notebook image maintained by the Jupyter Docker Stacks project. It includes a current Spark version, JupyterLab, and common Python data science libraries.

By Prabeesh Keezhathra

June 19, 2015 - 4 minutes read - 832 words

To run PySpark in a Jupyter notebook with Docker, run docker run -d -t -p 8888:8888 jupyter/pyspark-notebook and open http://localhost:8888 in your browser. No local Spark or Java installation is needed.

Apache Spark works well in a Jupyter notebook: you get iterative development, inline plots, and the ability to poke at intermediate DataFrames. Docker makes the setup reproducible and removes the “works on my machine” problem. This post walks through running PySpark in Jupyter via the official jupyter/pyspark-notebook image.

Installing Docker

Docker is a containerization platform that allows you to package and deploy your applications in a predictable and isolated environment.

To install Docker, use the following command. This command was run on an Ubuntu-14-04 instance, but you can find more options on the Docker official site.

1# This command installs Docker on your machine
2wget -qO- https://get.docker.com/ | sh

Running the PySpark Notebook

To run the PySpark Notebook, use the following command on any machine with Docker installed.

1# This command runs the pyspark-notebook Docker container and exposes port 8888 for access to the notebook
2docker run -d -t -p 8888:8888 prabeeshk/pyspark-notebook

After the pyspark-notebook Docker container is up and running, you can access the PySpark Notebook by directing your web browser to http://127.0.0.1:8888 or http://localhost:8888.

For more information on the Docker image, check out the Dockerhub repository.

The source code can be found in the GitHub repository. Below you will find the custom PySpark startup script and the Dockerfile.

 1## PySpark Startup Script
 2
 3# Import required modules
 4import os
 5import sys
 6
 7# Get the value of the SPARK_HOME environment variable
 8spark_home = os.environ.get('SPARK_HOME', None)
 9
10# If SPARK_HOME is not set, raise an error
11if not spark_home:
12raise ValueError('SPARK_HOME environment variable is not set')
13
14# Add the paths to the Python libraries for Spark and py4j to the system path
15sys.path.insert(0, os.path.join(spark_home, 'python'))
16sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
17
18# Execute the pyspark shell script to launch PySpark
19execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

This script sets SPARK_HOME, adds the Spark Python libraries and py4j to sys.path, and then runs the PySpark shell initialiser so that sc (SparkContext) is available when the notebook starts.

 1## Dockerfile
 2FROM ubuntu:trusty
 3
 4MAINTAINER Prabeesh Keezhathra.
 5
 6# Update the package list and install Java
 7RUN \
 8    apt-get -y update &&\
 9    echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" > /etc/apt/sources.list.d/webupd8team-java.list &&\
10    echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" >> /etc/apt/sources.list.d/webupd8team-java.list &&\
11    apt-key adv --keyserver keyserver.ubuntu.com --recv-keys EEA14886 &&\
12    apt-get -y update &&\
13    echo oracle-java7-installer shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections &&\
14    apt-get install -y oracle-java7-installer &&\
15    apt-get install -y curl
16
17# Set the version of Spark to install and the installation directory
18ENV SPARK_VERSION 1.4.0
19ENV SPARK_HOME /usr/local/src/spark-$SPARK_VERSION
20
21# Download and extract Spark to the installation directory and build Spark
22RUN \
23    mkdir -p $SPARK_HOME &&\
24    curl -s http://d3kbcqa49mib13.cloudfront.net/spark-$SPARK_VERSION.tgz | tar -xz -C $SPARK_HOME --strip-components=1 &&\
25    cd $SPARK_HOME &&\
26    build/mvn -DskipTests clean package
27
28# Set the Python path to include the Spark installation
29ENV PYTHONPATH $SPARK_HOME/python/:$PYTHONPATH
30
31# Install build essentials, Python, and the Python package manager pip
32RUN apt-get install -y build-essential \
33    python \
34    python-dev \
35    python-pip \
36    python-zmq
37
38# Install Python libraries for interacting with Spark
39RUN pip install py4j \
40    ipython[notebook]==3.2 \
41    jsonschema \
42    jinja2 \
43    terminado \
44    tornado
45
46# Create an IPython profile for PySpark
47RUN ipython profile create pyspark
48
49# Copy the custom PySpark startup script to the IPython profile directory
50COPY pyspark-notebook.py /root/.ipython/profile_pyspark/startup/pyspark-notebook.py
51
52# Create a volume for the notebook directory
53VOLUME /notebook
54
55# Set the working directory to the notebook directory
56WORKDIR /notebook
57
58# Expose port 8888 for the IPython Notebook server
59EXPOSE 8888
60
61# Run IPython with the PySpark profile and bind to all interfaces
62CMD ipython notebook --no-browser --profile=pyspark --ip=*

The image is based on Ubuntu 14.04 (Trusty), installs Java 7, downloads and builds Spark 1.4.0, then layers on IPython Notebook 3.2 with a custom PySpark startup profile. Port 8888 is exposed for the notebook server.

Note: this Dockerfile targets Spark 1.4 on Ubuntu 14.04. For a current setup, see the Spark 3 install post or use the maintained jupyter/pyspark-notebook Docker image.

Frequently asked questions

How do I run PySpark in a Jupyter notebook without installing Spark locally?

Use Docker: docker run -d -t -p 8888:8888 jupyter/pyspark-notebook. This gives you a Jupyter environment with PySpark pre-installed. No local Java or Spark setup required.

How do I persist my notebooks when using PySpark with Docker?

Mount a local directory as a volume: docker run -d -t -p 8888:8888 -v $(pwd)/notebooks:/home/jovyan/work jupyter/pyspark-notebook. Files saved in the /home/jovyan/work directory inside the container will persist on your host machine.

Which Docker image should I use for PySpark in 2024+?

Use the official jupyter/pyspark-notebook image maintained by the Jupyter Docker Stacks project. It includes a current Spark version, JupyterLab, and common Python data science libraries. Avoid building from source unless you need a custom Spark build.

Installing Docker

Running the PySpark Notebook

Frequently asked questions

How do I run PySpark in a Jupyter notebook without installing Spark locally?

How do I persist my notebooks when using PySpark with Docker?

Which Docker image should I use for PySpark in 2024+?

Related Posts

Building Self-Contained PySpark Applications

Apache Spark Performance Tuning: Spill, Skew, Shuffle, Storage

Install Apache Spark 2 on Ubuntu 16.04 and macOS