Spark Architecture & DataFrame API
- Spark architecture: driver, executors, cluster manager, DAG scheduler
- RDD vs DataFrame vs Dataset: why DataFrames are the right default
- DataFrame operations: select, filter, groupBy, agg, join, window
- Spark SQL: creating views, complex SQL queries, UDFs
- Schema inference vs explicit schema definition
- Reading and writing: Parquet, ORC, CSV, JSON, Delta Lake
- Partitioning: partition pruning, repartition vs coalesce
- Broadcast joins vs sort-merge joins: choosing based on data size
- Catalyst optimizer and the physical plan: reading EXPLAIN
- Caching and persistence: storage levels, when to cache
Structured Streaming, Tuning & Cloud Deployment
- Structured Streaming: source, sink, trigger, output modes
- Kafka source: reading from Kafka topics, exactly-once semantics
- Watermarking: handling late data in event-time processing
- Delta Lake: ACID transactions on Spark, time travel, schema evolution
- Spark performance tuning: executor memory, parallelism, shuffle configuration
- Adaptive Query Execution (AQE): auto-tuning at runtime
- Spark on Kubernetes: operator deployment, dynamic resource allocation
- Databricks: Delta Live Tables, Unity Catalog, job orchestration
- PySpark vs Scala Spark: choosing the right language for your team
- Spark UI: understanding stages, tasks, shuffle read/write
Data engineers who can design and build Spark pipelines — from raw data ingestion through transformation to analytics-ready output, deployed on Kubernetes or a managed cloud platform.
- Write efficient Spark DataFrame transformations and understand the optimizer's execution plan
- Implement structured streaming pipelines consuming from Kafka with watermarking
- Apply Delta Lake for ACID-compliant data lake operations with schema evolution
- Tune Spark jobs for performance using the Spark UI and AQE
Book the Apache Spark training
Available as a focused 2-day course or combined with Apache Flink for a complete stream and batch processing comparison.
Get in touch