Apache Spark 3.0 Adaptive Query Execution
Today, Spark SQL is one of the most valuable components of Apache Spark. It powers both SQL queries and the DataFrame API. At its core, the Catalyst optimizer, which leverages advanced Scala features to build an extensible and extremely powerful query optimizer.
In this article, we will try to understand the workflow of the Catalyst Optimizer and then dive deep into the new optimizations that Adaptive Query Execution enables.
Adaptive Query Execution (AQE) is a new feature available in Apache Spark 3.0 that allows it to optimize and adjust query plans based on runtime statistics collected while the query is running. To understand how it works, let’s first have a look at the optimization stages that the Catalyst Optimizer performs.
Catalyst Optimizer 101
The catalyst optimizer applies optimizations during logical and physical planning stages. It optimizes the query logically then generates a range of physical plans and selects the most efficient one based on a cost model.
- Unresolved Logical Plan: For a SQL query or a Dataframe, the optimizer accepts the unresolved logical plan and checks…