In my previous post, I wrote about installation of Spark and Scala interactive shell. Here in this post, we’ll see how to do the same in Python.
Similar to Scala interactive shell, there is an interactive shell available for Python. You can run it with the below command from spark root folder:
Now you can enjoy Spark using Python interactive shell.
This shell might be sufficient for experimentations and developments. However, for production level, we should use a standalone application.
I talked about a stand alone Spark application in Scala in one of my previous post. Here comes the same written in Python – you can find more about it in Spark official site – and known as a self-contained PySpark application.
First, refer this post to build Spark using sbt assembly. Add Pyspark lib in system Python path as follows:
Add the following exports in end of bashrc file
export SPARK_HOME=<path to Spark home>
PySpark depends on the
py4j Python package. It helps Python interpreter to dynamically access the Spark object from the JVM.
Don’t forget to export the SPARK_HOME. Restart
BASH once it is done.
PySpark should be available in system path by now. After writing the Python code, one can simply run the code using
python command then it runs in local Spark instance with default configurations.
It is better to use the spark submit script if you want to pass the configuration values at runtime.
./bin/spark-submit --master local[*] <python_file.py>
For more details about spark submit refer here. From the site we can observe that the configuration values can be passed at run time. It can also be changed in the conf/spark-defaults.conf file. After configuring the spark config file the changes also get reflected while running pyspark applications using simple
The reason for why there is no
pip install for pyspark can be found in this jira ticket.
If you are a fan of ipython, then you have the option to run PySpark ipython notebook. Refer this blog post for more detail.