Training Agenda

Apache Spark

Apache Spark is the unified analytics engine for large-scale data processing — batch ETL, streaming, SQL queries, machine learning, and graph processing on a single distributed computing framework. With native integration into cloud platforms (Databricks, AWS EMR, GCP Dataproc) and Delta Lake, Spark is the backbone of modern data engineering pipelines. This training covers Spark architecture, the DataFrame API, Spark SQL, structured streaming, and production deployment on Kubernetes and cloud platforms.

2 days On-site, remote, or hybrid Up to 20 participants German or English
What We Cover
Distributed data processing at scale — batch, SQL, and streaming
Day 1

Spark Architecture & DataFrame API

  • Spark architecture: driver, executors, cluster manager, DAG scheduler
  • RDD vs DataFrame vs Dataset: why DataFrames are the right default
  • DataFrame operations: select, filter, groupBy, agg, join, window
  • Spark SQL: creating views, complex SQL queries, UDFs
  • Schema inference vs explicit schema definition
  • Reading and writing: Parquet, ORC, CSV, JSON, Delta Lake
  • Partitioning: partition pruning, repartition vs coalesce
  • Broadcast joins vs sort-merge joins: choosing based on data size
  • Catalyst optimizer and the physical plan: reading EXPLAIN
  • Caching and persistence: storage levels, when to cache
Day 2

Structured Streaming, Tuning & Cloud Deployment

  • Structured Streaming: source, sink, trigger, output modes
  • Kafka source: reading from Kafka topics, exactly-once semantics
  • Watermarking: handling late data in event-time processing
  • Delta Lake: ACID transactions on Spark, time travel, schema evolution
  • Spark performance tuning: executor memory, parallelism, shuffle configuration
  • Adaptive Query Execution (AQE): auto-tuning at runtime
  • Spark on Kubernetes: operator deployment, dynamic resource allocation
  • Databricks: Delta Live Tables, Unity Catalog, job orchestration
  • PySpark vs Scala Spark: choosing the right language for your team
  • Spark UI: understanding stages, tasks, shuffle read/write
Learning Outcomes
What your team walks away with

Data engineers who can design and build Spark pipelines — from raw data ingestion through transformation to analytics-ready output, deployed on Kubernetes or a managed cloud platform.

Book the Apache Spark training

Available as a focused 2-day course or combined with Apache Flink for a complete stream and batch processing comparison.

Get in touch