Apache Spark Training — Raffael Hühnerschulte

What We Cover

Distributed data processing at scale — batch, SQL, and streaming

Day 1

Spark Architecture & DataFrame API

Spark architecture: driver, executors, cluster manager, DAG scheduler
RDD vs DataFrame vs Dataset: why DataFrames are the right default
DataFrame operations: select, filter, groupBy, agg, join, window
Spark SQL: creating views, complex SQL queries, UDFs
Schema inference vs explicit schema definition
Reading and writing: Parquet, ORC, CSV, JSON, Delta Lake
Partitioning: partition pruning, repartition vs coalesce
Broadcast joins vs sort-merge joins: choosing based on data size
Catalyst optimizer and the physical plan: reading EXPLAIN
Caching and persistence: storage levels, when to cache

Day 2

Structured Streaming, Tuning & Cloud Deployment

Structured Streaming: source, sink, trigger, output modes
Kafka source: reading from Kafka topics, exactly-once semantics
Watermarking: handling late data in event-time processing
Delta Lake: ACID transactions on Spark, time travel, schema evolution
Spark performance tuning: executor memory, parallelism, shuffle configuration
Adaptive Query Execution (AQE): auto-tuning at runtime
Spark on Kubernetes: operator deployment, dynamic resource allocation
Databricks: Delta Live Tables, Unity Catalog, job orchestration
PySpark vs Scala Spark: choosing the right language for your team
Spark UI: understanding stages, tasks, shuffle read/write

Learning Outcomes

What your team walks away with

Data engineers who can design and build Spark pipelines — from raw data ingestion through transformation to analytics-ready output, deployed on Kubernetes or a managed cloud platform.

Write efficient Spark DataFrame transformations and understand the optimizer's execution plan
Implement structured streaming pipelines consuming from Kafka with watermarking
Apply Delta Lake for ACID-compliant data lake operations with schema evolution
Tune Spark jobs for performance using the Spark UI and AQE

Book the Apache Spark training

Available as a focused 2-day course or combined with Apache Flink for a complete stream and batch processing comparison.

Get in touch