Apache Spark Performance Tuning: Spill, Skew, Shuffle, Storage
Performance tuning decides whether a Spark job runs in 10 minutes or 10 hours. Most slowdowns you’ll hit in …
The five most useful design patterns for PySpark data pipelines are Factory (swap data sources without changing pipeline code), Singleton (one shared SparkSession), Builder (compose transformations step by step), Observer (monitor pipeline events), and Pipeline (chain stages together). This tutorial shows each with a complete, runnable PySpark example.
If you want to write PySpark data pipelines that stay clean as they grow, design patterns are the most reliable tool to reach for. Pipelines get complex quickly: new data sources appear, transformations multiply, one-off scripts turn into production systems. Classic software design patterns give you proven structures to keep that complexity under control.
This tutorial walks through the five most useful patterns for a PySpark pipeline, each with a complete, runnable example:
Each example uses Python’s abc module to define abstract base classes and concrete subclasses. If you haven’t used abc.ABC before: it lets you declare a class contract via @abstractmethod, and any subclass must implement those methods or Python will refuse to instantiate it.
| Pattern | Use when you need to… | Example |
|---|---|---|
| Factory | Create data sources or transforms without hard-coding the concrete type | DataSourceFactory → CSV, Parquet, JSON |
| Singleton | Share one expensive object (SparkSession, connection, sink) | Single DataSink, single SparkSession |
| Builder | Construct objects with many optional parameters | DataTransformBuilder().set_param1(...).build() |
| Observer | Let several components react when data changes | Multiple sinks notified on a new batch |
| Pipeline | Chain ordered stages in an ETL flow | extract → clean → enrich → load |
The factory pattern is a creational pattern: a parent class defines an interface for creating objects, and subclasses decide which concrete class to instantiate. In a data pipeline, this lets you plug in different input formats, CSV, Parquet, JSON, without the pipeline code caring about the underlying format.
1from abc import ABC, abstractmethod
2
3class DataSourceFactory(ABC):
4 """Abstract factory for generating data sources."""
5
6 @abstractmethod
7 def create_data_source(self):
8 pass
9
10class CSVDataSourceFactory(DataSourceFactory):
11 """Concrete factory for generating CSV data sources."""
12
13 def create_data_source(self):
14 return CSVDataSource()
15
16class ParquetDataSourceFactory(DataSourceFactory):
17 """Concrete factory for generating Parquet data sources."""
18
19 def create_data_source(self):
20 return ParquetDataSource()
21
22class DataSource(ABC):
23 """Abstract base class for data sources."""
24
25 @abstractmethod
26 def load_data(self):
27 pass
28
29class CSVDataSource(DataSource):
30 """Concrete implementation of a CSV data source."""
31
32 def load_data(self):
33 # Use spark.read.csv(...) to load data from a CSV file
34 pass
35
36class ParquetDataSource(DataSource):
37 """Concrete implementation of a Parquet data source."""
38
39 def load_data(self):
40 # Use spark.read.parquet(...) to load data from a Parquet file
41 pass
42
43# Example usage
44factory = CSVDataSourceFactory()
45data_source = factory.create_data_source()
46data = data_source.load_data()
DataSourceFactory is the abstract factory; CSVDataSourceFactory and ParquetDataSourceFactory are concrete factories that each produce a specific DataSource. Pick the factory that matches the input format and call create_data_source(). The rest of the pipeline works against the DataSource interface and stays decoupled from the format.
When to use it: any time your pipeline needs to support more than one input format or runtime configuration.
The singleton pattern restricts a class to a single instance. In Spark, the canonical example is the SparkSession: you want exactly one, shared across the whole job. Output sinks and connection pools are similar cases.
1from threading import Lock
2
3class DataSink:
4 """Class for writing data to a sink."""
5
6 _instance = None
7 _lock = Lock()
8
9 def __new__(cls):
10 with cls._lock:
11 if cls._instance is None:
12 cls._instance = super().__new__(cls)
13 return cls._instance
14
15 def write_data(self, data):
16 # Write data to sink
17 pass
18
19# Example usage
20sink1 = DataSink()
21sink2 = DataSink()
22
23# sink1 and sink2 are the same instance
24assert sink1 is sink2
__new__ checks whether an instance already exists and returns it if so. _lock makes it safe under concurrent access, which matters when multiple threads bootstrap the pipeline at once.
When to use it: anything expensive to create and safe to share (SparkSession, Kafka producer, HTTP client, metrics emitter).
The builder pattern separates construction of a complex object from its representation. It shines when a class has many optional parameters. Rather than a constructor with ten arguments, you get a fluent API the reader can scan top-to-bottom.
1class DataTransform:
2 """Class for transforming data."""
3
4 def __init__(self, **kwargs):
5 self.param1 = kwargs.get("param1")
6 self.param2 = kwargs.get("param2")
7 self.param3 = kwargs.get("param3")
8
9 def transform(self, data):
10 # Transform data using specified parameters
11 pass
12
13class DataTransformBuilder:
14 """Builder for creating DataTransform objects."""
15
16 def __init__(self):
17 self.param1 = None
18 self.param2 = None
19 self.param3 = None
20
21 def set_param1(self, param1):
22 self.param1 = param1
23 return self
24
25 def set_param2(self, param2):
26 self.param2 = param2
27 return self
28
29 def set_param3(self, param3):
30 self.param3 = param3
31 return self
32
33 def build(self):
34 return DataTransform(
35 param1=self.param1,
36 param2=self.param2,
37 param3=self.param3,
38 )
39
40# Example usage
41transform = (
42 DataTransformBuilder()
43 .set_param1("value1")
44 .set_param3("value3")
45 .build()
46)
Each setter returns self, so the calls chain naturally and the caller sets only the parameters that matter for their use case.
When to use it: data readers, writers, and transforms with 5+ optional knobs (partitioning, compression, schema overrides, retry policy, etc.).
The observer pattern sets up a one-to-many dependency: when a subject changes state, it notifies every registered observer. In a data pipeline, this is useful when one successful batch should trigger several follow-up actions, write to the warehouse, update a dashboard, emit a metric, without hard-wiring them into the pipeline.
1from abc import ABC, abstractmethod
2
3class DataEvent(ABC):
4 """Abstract base class for data events."""
5
6 @abstractmethod
7 def get_data(self):
8 pass
9
10class DataUpdatedEvent(DataEvent):
11 """Concrete event fired when data is updated."""
12
13 def __init__(self, data):
14 self.data = data
15
16 def get_data(self):
17 return self.data
18
19class DataObserver(ABC):
20 """Abstract base class for data observers."""
21
22 @abstractmethod
23 def update(self, event):
24 pass
25
26class DataTransformObserver(DataObserver):
27 """Observer that applies a transform when data is updated."""
28
29 def __init__(self, transform):
30 self.transform = transform
31
32 def update(self, event):
33 data = event.get_data()
34 transformed_data = self.transform.transform(data)
35 # Do something with the transformed data
36
37class DataSubject:
38 """Manages observers and fires data events."""
39
40 def __init__(self):
41 self.observers = []
42
43 def register_observer(self, observer):
44 self.observers.append(observer)
45
46 def remove_observer(self, observer):
47 self.observers.remove(observer)
48
49 def notify_observers(self, event):
50 for observer in self.observers:
51 observer.update(event)
52
53# Example usage
54subject = DataSubject()
55transform_observer = DataTransformObserver(transform)
56subject.register_observer(transform_observer)
57subject.notify_observers(DataUpdatedEvent(data))
The DataSubject keeps the list of DataObserver instances and calls each one’s update() when notify_observers() fires. Adding a new reaction is a one-line change: implement a new observer and register it.
When to use it: fan-out reactions to a pipeline event (audit logs, cache invalidation, alerting).
The pipeline pattern (sometimes called chain-of-responsibility for data) chains a series of processing steps so the output of each stage feeds the next. It maps naturally onto ETL, extract, transform, load.
1from abc import ABC, abstractmethod
2
3class DataTransform(ABC):
4 """Abstract base class for data transforms in a pipeline."""
5
6 def __init__(self, next=None):
7 self.next = next
8
9 def set_next(self, next):
10 self.next = next
11 return next
12
13 @abstractmethod
14 def transform(self, data):
15 pass
16
17class TransformA(DataTransform):
18 def transform(self, data):
19 # Apply transform A
20 transformed_data = data
21 if self.next:
22 return self.next.transform(transformed_data)
23 return transformed_data
24
25class TransformB(DataTransform):
26 def transform(self, data):
27 # Apply transform B
28 transformed_data = data
29 if self.next:
30 return self.next.transform(transformed_data)
31 return transformed_data
32
33class TransformC(DataTransform):
34 def transform(self, data):
35 # Apply transform C
36 transformed_data = data
37 if self.next:
38 return self.next.transform(transformed_data)
39 return transformed_data
40
41# Example usage
42transform_a = TransformA()
43transform_b = TransformB()
44transform_c = TransformC()
45
46transform_a.set_next(transform_b).set_next(transform_c)
47transformed_data = transform_a.transform(data)
set_next() returns the next stage so you can wire the chain in one readable line. Each concrete transform only knows about its own logic and the handoff; adding, removing, or reordering steps has zero ripple effect.
When to use it: any multi-stage ETL/ELT flow, especially when stages need to be composed differently per job.
These five patterns combine well. A production pipeline typically looks like:
SparkSession shared by every component.DataSource based on config.They are reusable solutions to recurring problems in PySpark code, for example, how to create data sources flexibly (factory), how to share one SparkSession (singleton), or how to compose transforms (pipeline). The same shapes show up across almost every non-trivial pipeline.
Start with factory and pipeline. Factory removes hard-coded data formats; pipeline gives you a clean ETL skeleton. The others slot in once those two are in place.
Factory produces many instances of different concrete classes that share an interface. Singleton restricts a class to exactly one instance. They solve opposite problems and are often used together, a factory object itself may be implemented as a singleton.
abc class in Python?Yes. In Python, abstract base classes live in the abc module; any class that inherits from abc.ABC and uses @abstractmethod is what people casually call an “abc class”. It defines a contract that subclasses must implement.
Keep going: Advanced PySpark Design Patterns: Real-World Implementation Examples covers strategy, decorator, and template method patterns. For a one-page cheat sheet, see the PySpark Design Patterns Quick Reference. And if you’re tuning an existing pipeline, Apache Spark Performance Tuning pairs well with these patterns.
The five patterns above cover most production needs: flexible data sources (factory), shared resources (singleton), configurable transforms (builder), event fan-out (observer), and readable ETL chains (pipeline). The next step is to try each one on a small PySpark job of your own, nothing locks the lessons in faster than running the code.