Blogs

Performance Tuning on Apache Spark

Performance tuning is an important aspect of working with Apache Spark, as it can help ensure that your data processing tasks are efficient and run smoothly. In this blog post, we will delve into the common issues that can be considered when tuning the performance of Apache Spark. These issues include spill, skew, shuffle, storage, and serialization. Spill One problem that can occur is spill, which is the writing of temp files to disk due to a lack of memory.

How to Run a PySpark Notebook with Docker

Apache Spark is a powerful big data processing engine that is well-suited for use in a distributed environment. One way to interact with Spark is through the use of an IPython Notebook, which allows you to run and debug your Spark code in an interactive manner. This tutorial will guide you through the process of setting up and running a PySpark Notebook using Docker. Installing Docker Docker is a containerization platform that allows you to package and deploy your applications in a predictable and isolated environment.

Self Contained PySpark Application

In my previous post, I wrote about installation of Spark and Scala interactive shell. Here in this post, we’ll see how to do the same in Python.

Similar to Scala interactive shell, there is an interactive shell available for Python. You can run it with the below command from spark root folder:

./bin/pyspark

Now you can enjoy Spark using Python interactive shell.

This shell might be sufficient for experimentations and developments. However, for production level, we should use a standalone application.