PySpark Design Patterns for Data Pipelines

Q: Is an abstract class the same as an abc class in Python?

Yes. In Python, abstract base classes live in the abc module; any class that inherits from abc.ABC and uses @abstractmethod is what people casually call an abc class. It defines a contract that subclasses must implement.

January 14, 2023 - 9 minutes read - 1772 words

The five most useful design patterns for PySpark data pipelines are Factory (swap data sources without changing pipeline code), Singleton (one shared SparkSession), Builder (compose transformations step by step), Observer (monitor pipeline events), and Pipeline (chain stages together). This tutorial shows each with a complete, runnable PySpark example.

If you want to write PySpark data pipelines that stay clean as they grow, design patterns are the most reliable tool to reach for. Pipelines get complex quickly: new data sources appear, transformations multiply, one-off scripts turn into production systems. Classic software design patterns give you proven structures to keep that complexity under control.

This tutorial walks through the five most useful patterns for a PySpark pipeline, each with a complete, runnable example:

Factory Pattern, swap data sources (CSV, Parquet, JSON) without changing pipeline code
Singleton Pattern, share one SparkSession or sink across the whole pipeline
Builder Pattern, construct complex transforms with optional parameters
Observer Pattern, react to data changes across loosely coupled components
Pipeline Pattern, chain transforms into a clear, ordered flow

Each example uses Python’s abc module to define abstract base classes and concrete subclasses. If you haven’t used abc.ABC before: it lets you declare a class contract via @abstractmethod, and any subclass must implement those methods or Python will refuse to instantiate it.

Quick reference: when to use each pattern

Pattern	Use when you need to…	Example
Factory	Create data sources or transforms without hard-coding the concrete type	`DataSourceFactory` → CSV, Parquet, JSON
Singleton	Share one expensive object (SparkSession, connection, sink)	Single `DataSink`, single `SparkSession`
Builder	Construct objects with many optional parameters	`DataTransformBuilder().set_param1(...).build()`
Observer	Let several components react when data changes	Multiple sinks notified on a new batch
Pipeline	Chain ordered stages in an ETL flow	`extract → clean → enrich → load`

Factory Pattern

The factory pattern is a creational pattern: a parent class defines an interface for creating objects, and subclasses decide which concrete class to instantiate. In a data pipeline, this lets you plug in different input formats, CSV, Parquet, JSON, without the pipeline code caring about the underlying format.

 1from abc import ABC, abstractmethod
 2
 3class DataSourceFactory(ABC):
 4    """Abstract factory for generating data sources."""
 5    
 6    @abstractmethod
 7    def create_data_source(self):
 8        pass
 9
10class CSVDataSourceFactory(DataSourceFactory):
11    """Concrete factory for generating CSV data sources."""
12    
13    def create_data_source(self):
14        return CSVDataSource()
15
16class ParquetDataSourceFactory(DataSourceFactory):
17    """Concrete factory for generating Parquet data sources."""
18    
19    def create_data_source(self):
20        return ParquetDataSource()
21
22class DataSource(ABC):
23    """Abstract base class for data sources."""
24    
25    @abstractmethod
26    def load_data(self):
27        pass
28
29class CSVDataSource(DataSource):
30    """Concrete implementation of a CSV data source."""
31    
32    def load_data(self):
33        # Use spark.read.csv(...) to load data from a CSV file
34        pass
35
36class ParquetDataSource(DataSource):
37    """Concrete implementation of a Parquet data source."""
38    
39    def load_data(self):
40        # Use spark.read.parquet(...) to load data from a Parquet file
41        pass
42
43# Example usage
44factory = CSVDataSourceFactory()
45data_source = factory.create_data_source()
46data = data_source.load_data()

DataSourceFactory is the abstract factory; CSVDataSourceFactory and ParquetDataSourceFactory are concrete factories that each produce a specific DataSource. Pick the factory that matches the input format and call create_data_source(). The rest of the pipeline works against the DataSource interface and stays decoupled from the format.

When to use it: any time your pipeline needs to support more than one input format or runtime configuration.

Singleton Pattern

The singleton pattern restricts a class to a single instance. In Spark, the canonical example is the SparkSession: you want exactly one, shared across the whole job. Output sinks and connection pools are similar cases.

 1from threading import Lock
 2
 3class DataSink:
 4    """Class for writing data to a sink."""
 5    
 6    _instance = None
 7    _lock = Lock()
 8    
 9    def __new__(cls):
10        with cls._lock:
11            if cls._instance is None:
12                cls._instance = super().__new__(cls)
13        return cls._instance
14    
15    def write_data(self, data):
16        # Write data to sink
17        pass
18
19# Example usage
20sink1 = DataSink()
21sink2 = DataSink()
22
23# sink1 and sink2 are the same instance
24assert sink1 is sink2

__new__ checks whether an instance already exists and returns it if so. _lock makes it safe under concurrent access, which matters when multiple threads bootstrap the pipeline at once.

When to use it: anything expensive to create and safe to share (SparkSession, Kafka producer, HTTP client, metrics emitter).

Builder Pattern

The builder pattern separates construction of a complex object from its representation. It shines when a class has many optional parameters. Rather than a constructor with ten arguments, you get a fluent API the reader can scan top-to-bottom.

 1class DataTransform:
 2    """Class for transforming data."""
 3    
 4    def __init__(self, **kwargs):
 5        self.param1 = kwargs.get("param1")
 6        self.param2 = kwargs.get("param2")
 7        self.param3 = kwargs.get("param3")
 8    
 9    def transform(self, data):
10        # Transform data using specified parameters
11        pass
12
13class DataTransformBuilder:
14    """Builder for creating DataTransform objects."""
15    
16    def __init__(self):
17        self.param1 = None
18        self.param2 = None
19        self.param3 = None
20    
21    def set_param1(self, param1):
22        self.param1 = param1
23        return self
24    
25    def set_param2(self, param2):
26        self.param2 = param2
27        return self
28    
29    def set_param3(self, param3):
30        self.param3 = param3
31        return self
32    
33    def build(self):
34        return DataTransform(
35            param1=self.param1,
36            param2=self.param2,
37            param3=self.param3,
38        )
39
40# Example usage
41transform = (
42    DataTransformBuilder()
43    .set_param1("value1")
44    .set_param3("value3")
45    .build()
46)

Each setter returns self, so the calls chain naturally and the caller sets only the parameters that matter for their use case.

When to use it: data readers, writers, and transforms with 5+ optional knobs (partitioning, compression, schema overrides, retry policy, etc.).

Observer Pattern

The observer pattern sets up a one-to-many dependency: when a subject changes state, it notifies every registered observer. In a data pipeline, this is useful when one successful batch should trigger several follow-up actions, write to the warehouse, update a dashboard, emit a metric, without hard-wiring them into the pipeline.

 1from abc import ABC, abstractmethod
 2
 3class DataEvent(ABC):
 4    """Abstract base class for data events."""
 5    
 6    @abstractmethod
 7    def get_data(self):
 8        pass
 9
10class DataUpdatedEvent(DataEvent):
11    """Concrete event fired when data is updated."""
12    
13    def __init__(self, data):
14        self.data = data
15    
16    def get_data(self):
17        return self.data
18
19class DataObserver(ABC):
20    """Abstract base class for data observers."""
21    
22    @abstractmethod
23    def update(self, event):
24        pass
25
26class DataTransformObserver(DataObserver):
27    """Observer that applies a transform when data is updated."""
28    
29    def __init__(self, transform):
30        self.transform = transform
31    
32    def update(self, event):
33        data = event.get_data()
34        transformed_data = self.transform.transform(data)
35        # Do something with the transformed data
36
37class DataSubject:
38    """Manages observers and fires data events."""
39    
40    def __init__(self):
41        self.observers = []
42    
43    def register_observer(self, observer):
44        self.observers.append(observer)
45    
46    def remove_observer(self, observer):
47        self.observers.remove(observer)
48    
49    def notify_observers(self, event):
50        for observer in self.observers:
51            observer.update(event)
52
53# Example usage
54subject = DataSubject()
55transform_observer = DataTransformObserver(transform)
56subject.register_observer(transform_observer)
57subject.notify_observers(DataUpdatedEvent(data))

The DataSubject keeps the list of DataObserver instances and calls each one’s update() when notify_observers() fires. Adding a new reaction is a one-line change: implement a new observer and register it.

When to use it: fan-out reactions to a pipeline event (audit logs, cache invalidation, alerting).

Pipeline Pattern

The pipeline pattern (sometimes called chain-of-responsibility for data) chains a series of processing steps so the output of each stage feeds the next. It maps naturally onto ETL, extract, transform, load.

 1from abc import ABC, abstractmethod
 2
 3class DataTransform(ABC):
 4    """Abstract base class for data transforms in a pipeline."""
 5    
 6    def __init__(self, next=None):
 7        self.next = next
 8    
 9    def set_next(self, next):
10        self.next = next
11        return next
12    
13    @abstractmethod
14    def transform(self, data):
15        pass
16
17class TransformA(DataTransform):
18    def transform(self, data):
19        # Apply transform A
20        transformed_data = data
21        if self.next:
22            return self.next.transform(transformed_data)
23        return transformed_data
24
25class TransformB(DataTransform):
26    def transform(self, data):
27        # Apply transform B
28        transformed_data = data
29        if self.next:
30            return self.next.transform(transformed_data)
31        return transformed_data
32
33class TransformC(DataTransform):
34    def transform(self, data):
35        # Apply transform C
36        transformed_data = data
37        if self.next:
38            return self.next.transform(transformed_data)
39        return transformed_data
40
41# Example usage
42transform_a = TransformA()
43transform_b = TransformB()
44transform_c = TransformC()
45
46transform_a.set_next(transform_b).set_next(transform_c)
47transformed_data = transform_a.transform(data)

set_next() returns the next stage so you can wire the chain in one readable line. Each concrete transform only knows about its own logic and the handoff; adding, removing, or reordering steps has zero ripple effect.

When to use it: any multi-stage ETL/ELT flow, especially when stages need to be composed differently per job.

Putting it all together

These five patterns combine well. A production pipeline typically looks like:

A singleton SparkSession shared by every component.
A factory that picks the right DataSource based on config.
A builder that assembles the transform chain with per-job parameters.
A pipeline that chains the transforms in order.
An observer that fires side effects (metrics, alerts, downstream notifications) after each successful run.

Frequently asked questions

What are design patterns in PySpark?

They are reusable solutions to recurring problems in PySpark code, for example, how to create data sources flexibly (factory), how to share one SparkSession (singleton), or how to compose transforms (pipeline). The same shapes show up across almost every non-trivial pipeline.

Which pattern should I start with?

Start with factory and pipeline. Factory removes hard-coded data formats; pipeline gives you a clean ETL skeleton. The others slot in once those two are in place.

What is the difference between the factory and singleton patterns?

Factory produces many instances of different concrete classes that share an interface. Singleton restricts a class to exactly one instance. They solve opposite problems and are often used together, a factory object itself may be implemented as a singleton.

Is an abstract class the same as an `abc` class in Python?

Yes. In Python, abstract base classes live in the abc module; any class that inherits from abc.ABC and uses @abstractmethod is what people casually call an “abc class”. It defines a contract that subclasses must implement.

Where can I learn more PySpark design patterns?

Keep going: Advanced PySpark Design Patterns: Real-World Implementation Examples covers strategy, decorator, and template method patterns. For a one-page cheat sheet, see the PySpark Design Patterns Quick Reference. And if you’re tuning an existing pipeline, Apache Spark Performance Tuning pairs well with these patterns.

Conclusion

The five patterns above cover most production needs: flexible data sources (factory), shared resources (singleton), configurable transforms (builder), event fan-out (observer), and readable ETL chains (pipeline). The next step is to try each one on a small PySpark job of your own, nothing locks the lessons in faster than running the code.