Skip to main content

Azure Databricks A-Z Complete Guide | Azure Dp 203 Databricks Learning Material

Azure Databricks A-Z Complete Guide | Azure Dp 203 Databricks Learning Material


This article will provide readers with a comprehensive understanding of Azure Databricks, its architecture, cluster management, notebook usage, data access, secrets management, and securing access to Delta Lake, empowering them to make the most of this powerful platform for their data-related tasks.

Overview

  1. Databricks Overview
  2. Azure Databricks Architecture
  3. Azure Databricks Cluster and Their Types
  4. Azure Databricks Cluster Configurations & Policy.
  5. Azure Databricks Cluster Pool
  6. Azure Databricks Notebook
  7. Databricks Magic Command
  8. Databricks Utility
  9. Accessing Azure Delta Lake from Azure Databricks
  10. Secure Accessing to Azure Delta Lake.

Databricks Overview

Press enter or click to view image in full size

Azure Databricks is a powerful and fully managed analytics and machine learning service provided by azure, that is designed for big data processing and advanced analytics. It combines the capabilities of Apache Spark and a collaborative environment for data science, making it a preferred choice for data engineers, data scientists, and analysts.
Seamless integration with various Azure services, including Azure Data Lake Storage, Azure SQL Data Warehouse, and more, for end-to-end data pipelines.

Azure Databricks Architecture

Azure Databricks architecture operates out of a control plane and a data plane.

Press enter or click to view image in full size

Control plane and data plane

  • The control plane includes the backend services that Azure Databricks manages in its own Azure account. Notebook commands and many other workspace configurations are stored in the control plane and encrypted at rest.
  • Your Azure account manages the data plane, and is where your data resides. This is also where data is processed. Use Azure Databricks connectors to connect clusters to external data sources outside of your Azure account to ingest data, or for storage. You can also ingest data from external streaming data sources, such as events data, streaming data, IoT data, and more.

Azure Databricks Cluster and Their Types

An Azure Databricks cluster is Collection of virtual machines (VMs) that work together to process data and run computations.Cluster contains Driver node and Worker node for Processing the task.
Driver Node: set task for worker node
Worker Node: are responsible for execute task assiged by driver node.
In azure we have two type of cluster. the difference in these two cluster listed below
1. All Purpose cluster
2. Job Cluster

Press enter or click to view image in full size

Azure Databricks Cluster Configurations

In Azure we have No. of configuration options are present, as shown in the image below Let’s disscus each and every configuration in details.

Press enter or click to view image in full size

Policys are a set of rules used by admins to limit the configuration options available to users when they create a cluster. To configure a cluster according to a policy, select a policy from the Policy dropdown.

Access Mode Cluster access mode is a security feature that determines who can use a cluster and what data they can access via the cluster. When you create any cluster in Azure Databricks, you must select an access mode.In azure we have different type of access mode

  1. Single User: In single user we have only one user support in python, sql, Scala & R.
  2. Shared : In shared we have Multi user support with premium tier support in all above languages.
  3. No Isolation Shared: No Isolation shared also multi user support with no premium teir.
  4. Custom: It follow legacy configuration according to latest updates.

DataBricks RunTime

Databricks runtime optimized environment for running data processing and analysis workloads, particularly those involving Apache Spark.
Which Databricks Runtime version should you use?

  • For all-purpose compute, Databricks recommends using the latest Databricks Runtime version. Using the most current version will ensure you have the latest optimizations and most up-to-date compatibility between your code and preloaded packages.
  • For job clusters running operational workloads, consider using the Long Term Support (LTS) Databricks Runtime version. Using the LTS version will ensure you don’t run into compatibility issues and can thoroughly test your workload before upgrading.
  • For advanced machine learning use cases, consider the specialized Databricks Runtime version.

Databricks Photon

Photon is a high-performance Azure Databricks-native vectorized query engine that runs your SQL workloads and DataFrame API calls faster to reduce your total cost per workload's. following are key features and advantages of using Photon.

  • Support for SQL and equivalent DataFrame operations with Delta and Parquet tables.
  • Accelerated queries that process data faster and include aggregations and joins.
  • Faster performance when data is accessed repeatedly from the disk cache.
  • Robust scan performance on tables with many columns and many small files.
  • Faster Delta and Parquet writing using UPDATEDELETEMERGE INTOINSERT, and CREATE TABLE AS SELECT, including wide tables that contain thousands of columns.
  • Replaces sort-merge joins with hash-joins.

Cluster Pool

A cluster pool is set of Ideal virtual machine that is used for start cluster and autoscaling according to the need.

Databricks Policy’s

Policy are a set of rules used by admins to limit the configuration options available to users when they create a cluster. To configure a cluster according to a policy, select a policy from the Policy dropdown.
Policies have access control lists that regulate which users and groups have access to the policies.
If a user doesn’t have the unrestricted cluster entitlements, then they can only create clusters using their granted policies. By defaults By default, all users have access to the Personal Compute policy, allowing them to create single-machine compute resources. If you don’t see the Personal Compute policy as an option when you create a cluster, then you haven’t been given access to the policy.

Azure Databricks Notebook

Databricks notebook is a Jupyter kind interface, that is the collection of ruining shells. you can run shell into multiple language support i.e. Python, Scala, R, SQL etc and write you code for interacting with other services.

Databricks Magic Command

The primary work of magic command is to work with the defaults language in notebook. You can also switch the languages from python to Scala and Scala to sql. magic command also helping to interacting with the shell and helpful for writing documentation. Below are some list of magic command with description.

%sql : used for getting sql running inside the notebook shell.
%Scala: used for Scala running shell.
%python: for python shell.
%R: is used for R shell.
%md : is used for write markdown documentation.
%fs: is used for interacting with default filesystem. You need to use ls and other Linux command with %fs

Databricks Utility

Databricks utility are useful when we need to combine multiple task into the single notebook's utility funcanality support in python, Scala and r not in sql. here are some of utility that we used for notebook.
1. Filesystem Utility: used for interact with filesystem.
2. Secrete Utility: used for dealing with secretes that are store in azure key volts or in azure cluster policy.
3. widgets Utility: used for parametrized notebook at the runtime.
4. Notebook Workflow Utility: allow us to invoke & change one notebook to another.

Accessing Azure Delta Lake from Azure Databricks

In Azure we can access delta lake table or blob container using various method.i.e

  1. Access Data Lake Storage Gen2 or Blob Storage with an Azure service principal in Azure Databricks
    2. Access Azure delta lake using access keys.
    3. Access Azure delta table using SAS Token.
    4. Access Azure delta table using Cluster Scoped Authentication.
    5.Access Azure Data Lake using Credential Passthrough

Secure Accessing to Azure Delta Lake

Securing access to Azure Delta Lake is crucial to protect your data and maintain data integrity. You can use various Azure services and best practices to secure access to your Delta Lake. Here are the steps to help you secure access effectively:

Azure Data Lake Storage Permissions:

Start by securing the underlying Azure Data Lake Storage account where your Delta Lake is stored. You can use Azure RBAC (Role-Based Access Control) to grant permissions to specific Azure AD identities or service principals.

Assign roles like “Storage Blob Data Contributor” or “Storage Blob Data Owner” to those who need read or write access to the data lake.

Azure Key Volt For Sceure Access:

  1. Set up an Azure Key Vault.
  2. Configure secrets or keys in Azure Key Vault for authentication.
  3. Grant appropriate permissions to the Databricks cluster or other services to access the Key Vault.
  4. Modify your Databricks or Delta Lake access configuration to use Azure Key Vault for secure authentication.

Azure Databricks Cluster Configuration:

If you’re using Azure Databricks, configure your Databricks cluster to use the appropriate identity and permissions to access the data lake:

a. Use Managed Identity:If possible, use Azure Managed Identity (formerly known as Managed Service Identity) for your Databricks cluster. Managed Identity allows the cluster to authenticate directly with Azure services without storing credentials.

b. Service Principals:If you can’t use Managed Identity, configure your Databricks cluster to use a service principal to authenticate with the Azure Data Lake Storage account. Use the service principal to access the data lake as mentioned in the previous answer.

Network Security:

Implement network security to control which IP addresses or virtual networks can access your Azure Data Lake Storage. You can use Azure Firewall, Azure Virtual Network service endpoints, and network security groups (NSGs) to restrict access.

Encryption:

Enforce data encryption both in transit and at rest. Azure Storage provides encryption options, including Azure Blob Storage encryption for data at rest and HTTPS for data in transit.

Audit and Monitoring:

Enable Azure Monitor and Azure Security Center to monitor your Azure resources, including the data lake. Set up diagnostic logs and auditing to track access and changes to your data.

Access Control Lists (ACLs):

Use ACLs to set fine-grained permissions on files and directories within the data lake. ACLs can help restrict access to specific data even further.

Delta Lake Transaction Log Security:

Protect the Delta Lake transaction log, which contains metadata and transaction history. Ensure that only authorized users have access to this critical component of Delta Lake.

Authentication and Authorization:

Implement strong authentication and authorization mechanisms for applications and users accessing the data lake. Azure AD and OAuth 2.0 are commonly used for identity and access management.

Data Masking and Row-Level Security:

Consider implementing data masking and row-level security policies if you need to protect sensitive data within your Delta Lake.

Regular Updates and Patch Management:

Keep all Azure services, including Azure Databricks and the data lake, up to date with the latest security patches.

Security Policies and Compliance:

Follow industry-specific compliance standards and establish security policies aligned with your organization’s requirements.

In conclusion, this journey through the intricacies of Azure Databricks, its architecture, clusters, notebooks, and the secure handling of Delta Lake has equipped us with the knowledge and tools to harness the full potential of this cutting-edge platform, all while ensuring the utmost security and compliance in our data operations. With Azure Databricks, we’re not only empowered with data analytics but also fortified with the means to safeguard our most valuable digital resources.

Comments

Popular posts from this blog

Some Basic Concepts of Chemistry Handwritten Notes | Chemistry Class 11 Chapter 1 | Some Basic Concepts of Chemistry Notes for Board Exams | STAR tube Notes

Some Basic Concepts of Chemistry Handwritten Notes | Chemistry Class 11 Chapter 1 | Some Basic Concepts of Chemistry Notes for Board Exams | STAR tube Notes Chapter 1 Some Basics Concepts of Chemistry Some Basic Concepts of Chemistry Best Handwritten Notes Higher Secondary is the most crucial stage of school education because at this juncture specialized discipline based, content  ‐oriented courses are introduced. Students reach this stage after 10 years of general education and opt for Chemistry with a purpose of pursuing their career in basic sciences or professional courses like medicine, engineering, technology and study courses in applied areas of science and technology at tertiary level. Therefore, there is a need to provide learners with sufficient conceptual background of Chemistry, which will make them competent to meet the challenges of academic and professional courses after the senior secondary stage. STAR tube will provide you the Best Handwritten Notes of Class ...

NCERT All Chapters Handwritten Notes | Physics Chemistry Best Handwritten Notes | STAR tube Notes

NCERT All Chapters Handwritten Notes | Physics Chemistry Best Handwritten Notes | STAR tube Notes Are you looking for handwritten notes which are easy and simple to understand.Then you are on the right place because here you would get handwritten notes which are very easy to understand. It's not possible to cover whole syllabus and revise it during exam time because you have to revise lots of subjects in very less time. In this case,notes are one of the best option to cover whole syllabus in very short period of time. STAR tube will provide you the Best Handwritten Notes of Class 11 , Class 12 , Btech/Bsc Electrical and Electronics. Follow and Subscribe the STAR tube on  YouTube . Chemistry Handwritten Notes | Class 11 Unit I : Some Basic Concepts of Chemistry   -    PDF Unit II : Structure of Atom   -    PDF Unit III : Classification of Elements and Periodicity in Properties   -    PDF Unit IV : Chemical Bonding and M...

Classification of Elements & Periodicity in Properties Handwritten Notes | Chemistry Class 11 Chapter 3 | Classification of Elements & Periodicity in Properties Notes for Board Exams | STAR tube Notes

Classification of Elements & Periodicity in Properties Handwritten Notes | Chemistry Class 11 Chapter 3 | Classification of Elements & Periodicity in Properties Notes for Board Exams | STAR tube Notes Chapter 3 Classification of Elements & Periodicity in Properties Classification of Elements & Periodicity in Properties Notes Higher Secondary is the most crucial stage of school education because at this juncture specialized discipline based, content  ‐oriented courses are introduced. Students reach this stage after 10 years of general education and opt for Chemistry with a purpose of pursuing their career in basic sciences or professional courses like medicine, engineering, technology and study courses in applied areas of science and technology at tertiary level. Therefore, there is a need to provide learners with sufficient conceptual background of Chemistry, which will make them competent to meet the challenges of academic and professional courses after the senior ...