Building Self-Contained PySpark Applications
The earlier install post covered the Scala interactive shell. This one shows how to do the same in Python, then how to …
By Prabeesh Keezhathra
- 4 minutes read - 672 wordsApache Spark works well in a Jupyter notebook: you get iterative development, inline plots, and the ability to poke at intermediate DataFrames. Docker makes the setup reproducible and removes the “works on my machine” problem. This post walks through running PySpark in Jupyter via the official jupyter/pyspark-notebook image.
Docker is a containerization platform that allows you to package and deploy your applications in a predictable and isolated environment.
To install Docker, use the following command. This command was run on an Ubuntu-14-04 instance, but you can find more options on the Docker official site.
1# This command installs Docker on your machine
2wget -qO- https://get.docker.com/ | sh
To run the PySpark Notebook, use the following command on any machine with Docker installed.
1# This command runs the pyspark-notebook Docker container and exposes port 8888 for access to the notebook
2docker run -d -t -p 8888:8888 prabeeshk/pyspark-notebook
After the pyspark-notebook Docker container is up and running, you can access the PySpark Notebook by directing your web browser to http://127.0.0.1:8888 or http://localhost:8888.
For more information on the Docker image, check out the Dockerhub repository.
The source code can be found in the GitHub repository. Below you will find the custom PySpark startup script and the Dockerfile.
1## PySpark Startup Script
2
3# Import required modules
4import os
5import sys
6
7# Get the value of the SPARK_HOME environment variable
8spark_home = os.environ.get('SPARK_HOME', None)
9
10# If SPARK_HOME is not set, raise an error
11if not spark_home:
12raise ValueError('SPARK_HOME environment variable is not set')
13
14# Add the paths to the Python libraries for Spark and py4j to the system path
15sys.path.insert(0, os.path.join(spark_home, 'python'))
16sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
17
18# Execute the pyspark shell script to launch PySpark
19execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
This script sets SPARK_HOME, adds the Spark Python libraries and py4j to sys.path, and then runs the PySpark shell initialiser so that sc (SparkContext) is available when the notebook starts.
1## Dockerfile
2FROM ubuntu:trusty
3
4MAINTAINER Prabeesh Keezhathra.
5
6# Update the package list and install Java
7RUN \
8 apt-get -y update &&\
9 echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" > /etc/apt/sources.list.d/webupd8team-java.list &&\
10 echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" >> /etc/apt/sources.list.d/webupd8team-java.list &&\
11 apt-key adv --keyserver keyserver.ubuntu.com --recv-keys EEA14886 &&\
12 apt-get -y update &&\
13 echo oracle-java7-installer shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections &&\
14 apt-get install -y oracle-java7-installer &&\
15 apt-get install -y curl
16
17# Set the version of Spark to install and the installation directory
18ENV SPARK_VERSION 1.4.0
19ENV SPARK_HOME /usr/local/src/spark-$SPARK_VERSION
20
21# Download and extract Spark to the installation directory and build Spark
22RUN \
23 mkdir -p $SPARK_HOME &&\
24 curl -s http://d3kbcqa49mib13.cloudfront.net/spark-$SPARK_VERSION.tgz | tar -xz -C $SPARK_HOME --strip-components=1 &&\
25 cd $SPARK_HOME &&\
26 build/mvn -DskipTests clean package
27
28# Set the Python path to include the Spark installation
29ENV PYTHONPATH $SPARK_HOME/python/:$PYTHONPATH
30
31# Install build essentials, Python, and the Python package manager pip
32RUN apt-get install -y build-essential \
33 python \
34 python-dev \
35 python-pip \
36 python-zmq
37
38# Install Python libraries for interacting with Spark
39RUN pip install py4j \
40 ipython[notebook]==3.2 \
41 jsonschema \
42 jinja2 \
43 terminado \
44 tornado
45
46# Create an IPython profile for PySpark
47RUN ipython profile create pyspark
48
49# Copy the custom PySpark startup script to the IPython profile directory
50COPY pyspark-notebook.py /root/.ipython/profile_pyspark/startup/pyspark-notebook.py
51
52# Create a volume for the notebook directory
53VOLUME /notebook
54
55# Set the working directory to the notebook directory
56WORKDIR /notebook
57
58# Expose port 8888 for the IPython Notebook server
59EXPOSE 8888
60
61# Run IPython with the PySpark profile and bind to all interfaces
62CMD ipython notebook --no-browser --profile=pyspark --ip=*
The image is based on Ubuntu 14.04 (Trusty), installs Java 7, downloads and builds Spark 1.4.0, then layers on IPython Notebook 3.2 with a custom PySpark startup profile. Port 8888 is exposed for the notebook server.
Note: this Dockerfile targets Spark 1.4 on Ubuntu 14.04. For a current setup, see the Spark 3 install post or use the maintained
jupyter/pyspark-notebookDocker image.