Blogs
Install Apache Spark 3.5 on Linux (Ubuntu, CentOS)
A walkthrough of installing Apache Spark 3.5 on modern Linux, from prerequisites through a working standalone cluster. Earlier versions are covered in Install Apache Spark 1.0 on Ubuntu 14.04 and Install Apache Spark 2 on Ubuntu 16.04 and macOS.
Prerequisites Requirement Version / Recommendation Java OpenJDK 17 (Spark 3.5 supports 8, 11, 17) Python 3.8+ for PySpark Memory 4 GB minimum, 8 GB+ for comfortable work Storage 10 GB free for install + logs OS Ubuntu 20.04+, CentOS 7+, or equivalent …
Continue Reading
: Install Apache Spark 3.5 on Linux (Ubuntu, CentOS)Blogs
Apache Spark Performance Tuning: Spill, Skew, Shuffle, Storage
Performance tuning decides whether a Spark job runs in 10 minutes or 10 hours. Most slowdowns you’ll hit in production come from the same five areas: spill, skew, shuffle, storage, and serialization. This guide walks through each one with the cause, how to spot it in the Spark UI, and the PySpark code to fix it.
The examples use PySpark, but the concepts apply to Scala and Java Spark equally well.
Continue Reading
: Apache Spark Performance Tuning: Spill, Skew, Shuffle, StorageBlogs
Install Apache Spark 2 on Ubuntu 16.04 and macOS
Earlier posts covered Spark 1.1.0 on Ubuntu 14.04. This one walks through Spark 2.0.2 on Ubuntu 16.04 and Mac OS X Sierra. For the latest version, see Install Apache Spark 3.5 on Linux.
Continue Reading
: Install Apache Spark 2 on Ubuntu 16.04 and macOSBlogs
How to Run a PySpark Notebook with Docker
Apache Spark works well in a Jupyter notebook: you get iterative development, inline plots, and the ability to poke at intermediate DataFrames. Docker makes the setup reproducible and removes the “works on my machine” problem. This post walks through running PySpark in Jupyter via the official jupyter/pyspark-notebook image.
Installing Docker Docker is a containerization platform that allows you to package and deploy your applications in a predictable and isolated environment.
Continue Reading
: How to Run a PySpark Notebook with DockerBlogs
Building Self-Contained PySpark Applications
The earlier install post covered the Scala interactive shell. This one shows how to do the same in Python, then how to turn an experiment into a standalone application you can run with spark-submit.
./bin/pyspark Now you can enjoy Spark using Python interactive shell.
This shell might be sufficient for experimentations and developments. However, for production level, we should use a standalone application.
Continue Reading
: Building Self-Contained PySpark ApplicationsBlogs
Install Apache Spark on Ubuntu 14.04
Update: For Spark 2 see the Ubuntu 16.04 and macOS post.
This post walks through Spark 1.1.0 on Ubuntu 14.04. For the latest version, see Install Apache Spark 3.5 on Linux.
Prerequisites Start with Java:
$ sudo apt-add-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java7-installer To check that the Java installation is successful:
$ java -version It shows installed java version
java version "1.7.0_72"_ Java(TM) SE Runtime Environment (build …
Continue Reading
: Install Apache Spark on Ubuntu 14.04Blogs
Creating Uber JARs for Spark Projects with sbt-assembly
sbt-assembly packages a Spark project plus all its dependencies into one runnable “uber” JAR, which you can hand to spark-submit without worrying about classpath. This follows on from the standalone Spark application in Scala post.
Adding the sbt-assembly plugin The first step in creating an assembled JAR for your Spark application is to add the sbt-assembly plugin. To do this, you will need to add the following line to the project/plugin.sbt file: …
Continue Reading
: Creating Uber JARs for Spark Projects with sbt-assemblyBlogs
Standalone Spark Application in Scala: Twitter Streaming Example
This post walks through building a Spark Streaming application in Scala that extracts popular hashtags from the Twitter firehose, packaged with sbt, and runnable from the Eclipse IDE via the sbteclipse plugin.
Building a Spark application using SBT A standalone Scala application built against the Apache Spark API and packaged with sbt (Simple Build Tool).
For creating a stand alone app take the twitter popular tag example
This program calculates popular hashtags (popular topics) over sliding 10 …
Continue Reading
: Standalone Spark Application in Scala: Twitter Streaming ExampleBlogs
Running Mesos 0.13 on Ubuntu 12.04
Apache Mesos is a cluster manager that abstracts CPU, memory, and storage across machines so frameworks (like Spark, Marathon, or Chronos) can run workloads without worrying about which physical node they land on. This post covers installing 0.13 on Ubuntu 12.04 and bringing up a minimal master + slave cluster.
Prerequisites Install the build dependencies and make sure Java is available:
$ sudo apt-get install python2.7-dev g++ libcppunit-dev libunwind7-dev git libcurl4-nss-dev You need to have …
Continue Reading
: Running Mesos 0.13 on Ubuntu 12.04