Data Engineering using Databricks on AWS and Azure

Build Data Engineering Pipelines using Databricks core features such as Spark, Delta Lake, cloudFiles, etc.

Ratings: 4.30 / 5.00




Description

As part of this course, you will learn all the Data Engineering using cloud platform-agnostic technology called Databricks.

About Data Engineering

Data Engineering is nothing but processing the data depending on our downstream needs. We need to build different pipelines such as Batch Pipelines, Streaming Pipelines, etc as part of Data Engineering. All roles related to Data Processing are consolidated under Data Engineering. Conventionally, they are known as ETL Development, Data Warehouse Development, etc.

About Databricks

Databricks is the most popular cloud platform-agnostic data engineering tech stack. They are the committers of the Apache Spark project. Databricks run time provide Spark leveraging the elasticity of the cloud. With Databricks, you pay for what you use. Over a period of time, they came up with the idea of Lakehouse by providing all the features that are required for traditional BI as well as AI & ML. Here are some of the core features of Databricks.

  • Spark - Distributed Computing

  • Delta Lake - Perform CRUD Operations. It is primarily used to build capabilities such as inserting, updating, and deleting the data from files in Data Lake.

  • cloudFiles - Get the files in an incremental fashion in the most efficient way leveraging cloud features.

  • Databricks SQL - A Photon-based interface that is fine-tuned for running queries submitted for reporting and visualization by reporting tools. It is also used for Ad-hoc Analysis.

Course Details

As part of this course, you will be learning Data Engineering using Databricks.

  • Getting Started with Databricks

  • Setup Local Development Environment to develop Data Engineering Applications using Databricks

  • Using Databricks CLI to manage files, jobs, clusters, etc related to Data Engineering Applications

  • Spark Application Development Cycle to build Data Engineering Applications

  • Databricks Jobs and Clusters

  • Deploy and Run Data Engineering Jobs on Databricks Job Clusters as Python Application

  • Deploy and Run Data Engineering Jobs on Databricks Job Clusters using Notebooks

  • Deep Dive into Delta Lake using Dataframes on Databricks Platform

  • Deep Dive into Delta Lake using Spark SQL on Databricks Platform

  • Building Data Engineering Pipelines using Spark Structured Streaming on Databricks Clusters

  • Incremental File Processing using Spark Structured Streaming leveraging Databricks Auto Loader cloudFiles

  • Overview of AutoLoader cloudFiles File Discovery Modes - Directory Listing and File Notifications

  • Differences between Auto Loader cloudFiles File Discovery Modes - Directory Listing and File Notifications

  • Differences between traditional Spark Structured Streaming and leveraging Databricks Auto Loader cloudFiles for incremental file processing.

  • Overview of Databricks SQL for Data Analysis and reporting.

We will be adding a few more modules related to Pyspark, Spark with Scala, Spark SQL, and Streaming Pipelines in the coming weeks.

Desired Audience

Here is the desired audience for this advanced course.

  • Experienced application developers to gain expertise related to Data Engineering with prior knowledge and experience of Spark.

  • Experienced Data Engineers to gain enough skills to add Databricks to their profile.

  • Testers to improve their testing capabilities related to Data Engineering applications using Databricks.

Prerequisites

  • Logistics

    • Computer with decent configuration (At least 4 GB RAM, however 8 GB is highly desired)

    • Dual Core is required and Quad-Core is highly desired

    • Chrome Browser

    • High-Speed Internet

    • Valid AWS Account

    • Valid Databricks Account (free Databricks Account is not sufficient)

  • Experience as Data Engineer especially using Apache Spark

  • Knowledge about some of the cloud concepts such as storage, users, roles, etc.

Associated Costs

As part of the training, you will only get the material. You need to practice on your own or corporate cloud account and Databricks Account.

  • You need to take care of the associated AWS or Azure costs.

  • You need to take care of the associated Databricks costs.

Training Approach

Here are the details related to the training approach.

  • It is self-paced with reference material, code snippets, and videos provided as part of Udemy.

  • One needs to sign up for their own Databricks environment to practice all the core features of Databricks.

  • We would recommend completing 2 modules every week by spending 4 to 5 hours per week.

  • It is highly recommended to take care of all the tasks so that one can get real experience of Databricks.

  • Support will be provided through Udemy Q&A.

Here is the detailed course outline.

Getting Started with Databricks on Azure

As part of this section, we will go through the details about signing up to Azure and setup the Databricks cluster on Azure.

  • Getting Started with Databricks on Azure

  • Signup for the Azure Account

  • Login and Increase Quotas for regional vCPUs in Azure

  • Create Azure Databricks Workspace

  • Launching Azure Databricks Workspace or Cluster

  • Quick Walkthrough of Azure Databricks UI

  • Create Azure Databricks Single Node Cluster

  • Upload Data using Azure Databricks UI

  • Overview of Creating Notebook and Validating Files using Azure Databricks

  • Develop Spark Application using Azure Databricks Notebook

  • Validate Spark Jobs using Azure Databricks Notebook

  • Export and Import of Azure Databricks Notebooks

  • Terminating Azure Databricks Cluster and Deleting Configuration

  • Delete Azure Databricks Workspace by deleting Resource Group

Azure Essentials for Databricks - Azure CLI

As part of this section, we will go through the details about setting up Azure CLI to manage Azure resources using relevant commands.

  • Azure Essentials for Databricks - Azure CLI

  • Azure CLI using Azure Portal Cloud Shell

  • Getting Started with Azure CLI on Mac

  • Getting Started with Azure CLI on Windows

  • Warming up with Azure CLI - Overview

  • Create Resource Group using Azure CLI

  • Create ADLS Storage Account with in Resource Group

  • Add Container as part of Storage Account

  • Overview of Uploading the data into ADLS File System or Container

  • Setup Data Set locally to upload into ADLS File System or Container

  • Upload local directory into Azure ADLS File System or Container

  • Delete Azure ADLS Storage Account using Azure CLI

  • Delete Azure Resource Group using Azure CLI

Mount ADLS on to Azure Databricks to access files from Azure Blob Storage

As part of this section, we will go through the details related to mounting Azure Data Lake Storage (ADLS) on to Azure Databricks Clusters.

  • Mount ADLS on to Azure Databricks - Introduction

  • Ensure Azure Databricks Workspace

  • Setup Databricks CLI on Mac or Windows using Python Virtual Environment

  • Configure Databricks CLI for new Azure Databricks Workspace

  • Register an Azure Active Directory Application

  • Create Databricks Secret for AD Application Client Secret

  • Create ADLS Storage Account

  • Assign IAM Role on Storage Account to Azure AD Application

  • Setup Retail DB Dataset

  • Create ADLS Container or File System and Upload Data

  • Start Databricks Cluster to mount ADLS

  • Mount ADLS Storage Account on to Azure Databricks

  • Validate ADLS Mount Point on Azure Databricks Clusters

  • Unmount the mount point from Databricks

  • Delete Azure Resource Group used for Mounting ADLS on to Azure Databricks

Setup Local Development Environment for Databricks

As part of this section, we will go through the details related to setting up of local development environment for Databricks using tools such as Pycharm, Databricks dbconnect, Databricks dbutils, etc.

  • Setup Single Node Databricks Cluster

  • Install Databricks Connect

  • Configure Databricks Connect

  • Integrating Pycharm with Databricks Connect

  • Integrate Databricks Cluster with Glue Catalog

  • Setup AWS s3 Bucket and Grant Permissions

  • Mounting s3 Buckets into Databricks Clusters

  • Using Databricks dbutils from IDEs such as Pycharm

Using Databricks CLI

As part of this section, we will get an overview of Databricks CLI to interact with Databricks File System or DBFS.

  • Introduction to Databricks CLI

  • Install and Configure Databricks CLI

  • Interacting with Databricks File System using Databricks CLI

  • Getting Databricks Cluster Details using Databricks CLI

Databricks Jobs and Clusters

As part of this section, we will go through the details related to Databricks Jobs and Clusters.

  • Introduction to Databricks Jobs and Clusters

  • Creating Pools in Databricks Platform

  • Create Cluster on Azure Databricks

  • Request to Increase CPU Quota on Azure

  • Creating Job on Databricks

  • Submitting Jobs using Databricks Job Cluster

  • Create Pool in Databricks

  • Running Job using Interactive Databricks Cluster Attached to Pool

  • Running Job Using Databricks Job Cluster Attached to Pool

  • Exercise - Submit the application as a job using Databricks interactive cluster

Deploy and Run Spark Applications on Databricks

As part of this section, we will go through the details related to deploying Spark Applications on Databricks Clusters and also running those applications.

  • Prepare PyCharm for Databricks

  • Prepare Data Sets

  • Move files to ghactivity

  • Refactor Code for Databricks

  • Validating Data using Databricks

  • Setup Data Set for Production Deployment

  • Access File Metadata using Databricks dbutils

  • Build Deployable bundle for Databricks

  • Running Jobs using Databricks Web UI

  • Get Job and Run Details using Databricks CLI

  • Submitting Databricks Jobs using CLI

  • Setup and Validate Databricks Client Library

  • Resetting the Job using Databricks Jobs API

  • Run Databricks Job programmatically using Python

  • Detailed Validation of Data using Databricks Notebooks

Deploy and Run Spark Jobs using Notebooks

As part of this section, we will go through the details related to deploying Spark Applications on Databricks Clusters and also running those applications using Databricks Notebooks.

  • Modularizing Databricks Notebooks

  • Running Job using Databricks Notebook

  • Refactor application as Databricks Notebooks

  • Run Notebook using Databricks Development Cluster

Deep Dive into Delta Lake using Spark Data Frames on Databricks

As part of this section, we will go through all the important details related to Databricks Delta Lake using Spark Data Frames.

  • Introduction to Delta Lake using Spark Data Frames on Databricks

  • Creating Spark Data Frames for Delta Lake on Databricks

  • Writing Spark Data Frame using Delta Format on Databricks

  • Updating Existing Data using Delta Format on Databricks

  • Delete Existing Data using Delta Format on Databricks

  • Merge or Upsert Data using Delta Format on Databricks

  • Deleting using Merge in Delta Lake on Databricks

  • Point in Snapshot Recovery using Delta Logs on Databricks

  • Deleting unnecessary Delta Files using Vacuum on Databricks

  • Compaction of Delta Lake Files on Databricks

Deep Dive into Delta Lake using Spark SQL on Databricks

As part of this section, we will go through all the important details related to Databricks Delta Lake using Spark SQL.

  • Introduction to Delta Lake using Spark SQL on Databricks

  • Create Delta Lake Table using Spark SQL on Databricks

  • Insert Data to Delta Lake Table using Spark SQL on Databricks

  • Update Data in Delta Lake Table using Spark SQL on Databricks

  • Delete Data from Delta Lake Table using Spark SQL on Databricks

  • Merge or Upsert Data into Delta Lake Table using Spark SQL on Databricks

  • Using Merge Function over Delta Lake Table using Spark SQL on Databricks

  • Point in Snapshot Recovery using Delta Lake Table using Spark SQL on Databricks

  • Vacuuming Delta Lake Tables using Spark SQL on Databricks

  • Compaction of Delta Lake Tables using Spark SQL on Databricks

Accessing Databricks Cluster Terminal via Web as well as SSH

As part of this section, we will see how to access terminal related to Databricks Cluster via Web as well as SSH.

  • Enable Web Terminal in Databricks Admin Console

  • Launch Web Terminal for Databricks Cluster

  • Setup SSH for the Databricks Cluster Driver Node

  • Validate SSH Connectivity to the Databricks Driver Node on AWS

  • Limitations of SSH and comparison with Web Terminal related to Databricks Clusters

Installing Softwares on Databricks Clusters using init scripts

As part of this section, we will see how to bootstrap Databricks clusters by installing relevant 3rd party libraries for our applications.

  • Setup gen_logs on Databricks Cluster

  • Overview of Init Scripts for Databricks Clusters

  • Create Script to install software from git on Databricks Cluster

  • Copy init script to dbfs location

  • Create Databricks Standalone Cluster with init script

Quick Recap of Spark Structured Streaming

As part of this section, we will get a quick recap of Spark Structured streaming.

  • Validate Netcat on Databricks Driver Node

  • Push log messages to Netcat Webserver on Databricks Driver Node

  • Reading Web Server logs using Spark Structured Streaming

  • Writing Streaming Data to Files

Incremental Loads using Spark Structured Streaming on Databricks

As part of this section, we will understand how to perform incremental loads using Spark Structured Streaming on Databricks.

  • Overview of Spark Structured Streaming

  • Steps for Incremental Data Processing on Databricks

  • Configure Databricks Cluster with Instance Profile

  • Upload GHArchive Files to AWS s3 using Databricks Notebooks

  • Read JSON Data using Spark Structured Streaming on Databricks

  • Write using Delta file format using Trigger Once on Databricks

  • Analyze GHArchive Data in Delta files using Spark on Databricks

  • Add New GHActivity JSON files on Databricks

  • Load Data Incrementally to Target Table on Databricks

  • Validate Incremental Load on Databricks

  • Internals of Spark Structured Streaming File Processing on Databricks

Incremental Loads using autoLoader Cloud Files on Databricks

As part of this section we will see how to perform incremental loads using autoLoader cloudFiles on Databricks Clusters.

  • Overview of AutoLoader cloudFiles on Databricks

  • Upload GHArchive Files to s3 on Databricks

  • Write Data using AutoLoader cloudFiles on Databricks

  • Add New GHActivity JSON files on Databricks

  • Load Data Incrementally to Target Table on Databricks

  • Add New GHActivity JSON files on Databricks

  • Overview of Handling S3 Events using AWS Services on Databricks

  • Configure IAM Role for cloudFiles file notifications on Databricks

  • Incremental Load using cloudFiles File Notifications on Databricks

  • Review AWS Services for cloudFiles Event Notifications on Databricks

  • Review Metadata Generated for cloudFiles Checkpointing on Databricks

Overview of Databricks SQL Clusters

As part of this section, we will get an overview of Databricks SQL Clusters.

  • Overview of Databricks SQL Platform - Introduction

  • Run First Query using SQL Editor of Databricks SQL

  • Overview of Dashboards using Databricks SQL

  • Overview of Databricks SQL Data Explorer to review Metastore Databases and Tables

  • Use Databricks SQL Editor to develop scripts or queries

  • Review Metadata of Tables using Databricks SQL Platform

  • Overview of loading data into retail_db tables

  • Configure Databricks CLI to push data into the Databricks Platform

  • Copy JSON Data into DBFS using Databricks CLI

  • Analyze JSON Data using Spark APIs

  • Analyze Delta Table Schemas using Spark APIs

  • Load Data from Spark Data Frames into Delta Tables

  • Run Adhoc Queries using Databricks SQL Editor to validate data

  • Overview of External Tables using Databricks SQL

  • Using COPY Command to Copy Data into Delta Tables

  • Manage Databricks SQL Endpoints

What You Will Learn!

  • Data Engineering leveraging Databricks features
  • Databricks CLI to manage files, Data Engineering jobs and clusters for Data Engineering Pipelines
  • Deploying Data Engineering applications developed using PySpark on job clusters
  • Deploying Data Engineering applications developed using PySpark using Notebooks on job clusters
  • Perform CRUD Operations leveraging Delta Lake using Spark SQL for Data Engineering Applications or Pipelines
  • Perform CRUD Operations leveraging Delta Lake using Pyspark for Data Engineering Applications or Pipelines
  • Setting up development environment to develop Data Engineering applications using Databricks
  • Building Data Engineering Pipelines using Spark Structured Streaming on Databricks Clusters
  • Incremental File Processing using Spark Structured Streaming leveraging Databricks Auto Loader cloudFiles
  • Overview of Auto Loader cloudFiles File Discovery Modes - Directory Listing and File Notifications
  • Differences between Auto Loader cloudFiles File Discovery Modes - Directory Listing and File Notifications
  • Differences between traditional Spark Structured Streaming and leveraging Databricks Auto Loader cloudFiles for incremental file processing.

Who Should Attend!

  • Beginner or Intermediate Data Engineers who want to learn Databricks for Data Engineering
  • Intermediate Application Engineers who want to explore Data Engineering using Databricks
  • Data and Analytics Engineers who want to learn Data Engineering using Databricks
  • Testers who want to learn Databricks to test Data Engineering applications built using Databricks