
Explore the Apache Spark architecture and how it runs on a cluster. Learn cluster components and how Spark distributes data workloads, plus the responsibilities of a Spark application.
Explore how Apache Spark runs on a cluster by coordinating the cluster manager and node managers to allocate resources, start the Spark driver and executors, and achieve parallelism.
Create an Azure Databricks cluster by provisioning the workspace, selecting a Databricks runtime, choosing standard mode, and configuring worker and driver memory and cores with auto-scaling.
Learn how to create a free Databricks Account and create your first cluster
Install the dataset and Databricks notebooks by uploading files to a dedicated data folder, importing notebooks in workspace, then attach to a cluster and validate the customer json data.
Configure a Spark session in Databricks, read json data into a data frame, inspect its schema, and understand distribution across partitions and executors.
Define and view a dataframe schema with print schema, using a DDL string, a struct type, or Spark inference to set data types like integer and date.
Learn to select columns from a dataframe in Spark, using string names, column objects, and expressions, including struct fields and the year function in select expressions.
Rename a data frame column using withColumnRenamed by specifying the existing and new names. Confirm by inspecting the data frame columns, and note that a non-existent column causes no change.
Learn how to cast and change column data types in Spark dataframes, including integer to long and date to string, using cast and select expression, and print schema.
Add columns to a data frame using withColumn and lit. Concatenate first and last names with a space to form full name, and use expression for address id plus one.
Remove columns from a dataframe with the drop method, producing a dataframe. Drop single or multiple columns, handle non-existent columns, and understand string versus column object overloads in Apache Spark.
Compute basic arithmetic on DataFrame columns to derive expected net paid, using the width column method and expression, and display results in the sales performance DataFrame.
Explore how Apache Spark treats dataframes as immutable objects, using transformations to create new dataframes, and how actions trigger lineage-graph computations across a distributed cluster.
Learn how to filter a dataframe in Apache Spark using the filter and where methods with boolean expressions, including examples filtering by birth year, birth month, and birth country.
Explore how narrow dependencies in Apache Spark data frames link parent and child partitions, with examples like filter and select transforming one input partition into one output partition.
Remove duplicate rows in a Spark dataframe by using distinct, drop duplicates on specific columns, or drop duplicates with a column list; understand behavior aligns with distinct when no args.
Explore how to handle null values in Apache Spark using is null and is not null, and filter with where or filter to manage nulls in a dataframe.
Drop rows with nulls using DataFrameNaFunctions, choosing all or any nulls and targeting specific columns, then replace nulls with fill or a replacement map and view results in Databricks.
Sort dataframe rows with sort or order by, equivalent methods, while handling nulls and sorting by one or multiple columns or expressions, demonstrating ascending and descending orders and multi-column sorting.
Learn aggregations with group by on the web series purchases data, counting items per customer and aliasing as item count, while understanding shuffle and wide dependencies.
Apply aggregation over a data frame without group by to obtain max, min, average, and count values for a column like sales price; use describe or summary to explore statistics.
Perform a right outer join on the web series data frame and customer data frame in Apache Spark to identify customers with no purchases and use nulls for unmatched rows.
Perform a left outer join between two data frames in Apache Spark, matching on customer id and bill customer, and observe nulls in the right-side columns when no match.
Learn to append rows to dataframes using union and union by name in Spark, compare column position versus names, and apply distinct to remove duplicates after merging.
Cache a data frame to avoid recomputing joins by memory and disk storage, using cache or persist with configurable storage levels; run actions like count to materialize cached data.
Explore how the data frame writer API saves a data frame to external storage, including csv restrictions, overwrite options, repartitioning for file layout, and default parquet with compression options.
Partition a data frame by a column during write to create per-category folders. Trim trailing spaces to ensure folder accuracy and read from partitioned folders.
Learn how to write and register user defined functions for Apache Spark, use them in data frame operations and Spark SQL, and manage serialization and null safety.
Learn how Spark translates dataframes and sql queries into logical plans, resolves names with the internal catalog, optimizes and builds efficient physical plans with pushdown, broadcast joins, and code generation.
Explore how Apache Spark executes actions within an execution hierarchy, forming a job that splits into stages and tasks via shuffle, view DAG, and adjust partitions with shuffle configuration.
Learn how to partition a dataframe in Spark, using repartition and coalesce to control partitions, understand when shuffles occur, and compare stage behavior and data exchange.
Enable adaptive query execution in Apache Spark 3 to optimize the physical plan after each stage, enabling coalescing of small partitions and dynamic join strategy choices.
Register for the Databricks certified associate developer for Apache Spark 3 exam via Databricks Learn Certifications, review prerequisites and exam logistics like 60 questions, 70% passing, and online proctoring.
Do you want to learn how to handle massive amounts of data at scale?
Learn Apache Spark 3 and pass the Databricks Certified Associate Developer for Apache Spark 3.0
Hi, My name is Wadson, and I’m a Databricks Certified Associate Developer for Apache Spark.
Apache Spark has become the standard big-data cluster processing framework in today's data-driven world.
Apache Spark is used for Data Engineering, Data Science, and Machine Learning.
I will teach you everything you need to know about starting with Apache Spark.
You will learn the Architecture of Apache Spark and use its Core APIs to manipulate complex data.
You will write queries to perform transformations such as Join, Union, GroupBy, and more.
This course is for beginners.
You don't need any previous knowledge of Apache Spark.
Notebooks are available to download so that you can follow along with me in the videos.
The Notebooks contain all the source code I use in the course.
There are also Quizzes to help you assess your understanding of the topics.
Check Out some of the top reviews and enroll in the course.
"This course is really helpful with all the necessary details needed for the Certification: Databricks Certified Associate Developer for Apache Spark 3.0.
I've cleared the certification with 80% score and I'd suggest to check all the Course contents thoroughly"
"Very good course. Gives a good overview of all the necessary components of the spark application which are required for the test and that too in very short span of time. will highly recommend this course.
worth spending time !!"