
Use PySpark to group by gender and city, compute average salaries, identify the highest-paying city, and compare male and female pay with deltas and aliases.
Explore inspecting dataframes in PySpark by applying a schema to a csv read, viewing column types, the first rows, and descriptive statistics to identify nulls, duplicates, and data ranges.
Group by and aggregation techniques in PySpark demonstrate cleaning salary data, casting to float, and computing total, average, min, and max salaries by gender and by city.
Finish the course confident in PySpark and start seeing its value in your work. Ask questions in the Q&A, and I’ll be happy to help you out.
Spark is one of the most in-demand Big Data processing frameworks right now.
This course will take you through the core concepts of PySpark. We will work to enable you to do most of the things you’d do in SQL or Python Pandas library, that is:
Getting hold of data
Handling missing data and cleaning data up
Aggregating your data
Filtering it
Pivoting it
And Writing it back
All of these things will enable you to leverage Spark on large datasets and start getting value from your data.
Let’s get started.