Azure Databricks

Basics

Introduction

Azure Databricks, a cloud-based analytics platform built on Apache Spark, offers a unified solution for big data processing, ETL workflows, machine learning, and analytics. The platform supports data ingestion and processing from various sources, including Azure Storage, Azure Synapse, DB2, Oracle, SQL Server, Kafka, and Event Hub, using tools like Azure Data Factory, Event Hubs, and REST APIs. Data is transformed in Databricks Notebooks using PySpark, Scala, SQL, or R, with Delta Lake ensuring reliability. Processed data can then be used for reporting in Power BI, Synapse, or for developing machine learning models with Azure ML.

Architecture

Azure Databricks operates on a two-plane architecture—Control Plane and Data Plane—to separate management operations from data processing. This design enhances security, scalability, and operational flexibility.

Control Plane

Managed by Azure Databricks, the Control Plane handles orchestration, configuration, and user management. It does not access your data, ensuring a secure separation of responsibilities.

Key Components:

  • Databricks Workspace: A collaborative environment for notebooks, dashboards, jobs, and code.
  • Cluster Manager: Provisions, scales, and monitors Spark clusters.
  • Job Scheduler: Manages scheduled and triggered job executions.
  • Unity Catalog: Provides centralized data governance, access control, and lineage tracking across workspaces.

Data Plane

The Data Plane resides within your Azure subscription, where your actual data resides and Spark jobs are executed. This ensures that sensitive data never leaves your cloud environment.

Key Components:

  • Clusters: Spark-based compute environments that execute data engineering, machine learning, and analytics workloads.
  • Databricks Runtime: A performance-optimized Spark runtime with built-in libraries for ETL, ML, and streaming.
  • Storage: Integrates with Azure Data Lake Storage (ADLS), Blob Storage, and SQL Database.
  • Networking: Ensures secure data flow using features like VNet Injection, Private Link, and secure cluster connectivity.

Advantages & Disadvantages

Advantages:

  • Optimized Apache Spark processing for performance and cost-efficiency.
  • Auto-scaling clusters to handle workloads dynamically.
  • Delta Lake ensures transaction integrity and data versioning.
  • Seamless integration with Azure services.
  • Collaborative tools like shared notebooks for team-based analytics.

Limitations:

  • Cluster Management: Complex for beginners; requires expertise for optimization.
  • Costs: Advanced features and high-performance clusters can increase expenses.

Key Components

Security Features

  • Azure Active Directory (AAD): Ensures role-based access control.
  • Data Encryption: Handled via Azure Key Vault.
  • Private Link and VNet Injection: Secure data traffic.
  • Unity Catalog: Provides centralized data governance.

Cluster Types

  • All-Purpose Clusters: Suitable for interactive development.
  • Job Clusters: Temporary clusters for scheduled tasks.
  • High-Concurrency Clusters: Ideal for shared workloads.

Deployment Models

  • Standard: Basic setup.
  • Premium: Includes advanced security features.
  • Enterprise: Best suited for large-scale and secure workloads.

Workflows

Workflows enable the creation of jobs and tasks, which can include notebooks, Python scripts, SQL queries, Delta Live Tables, and more. They support hierarchical structures such as task dependencies, parallel execution, and sequential execution. Jobs can be scheduled using three types of triggers: time-based scheduling triggers for regular intervals, file arrival triggers to start jobs when files are detected, and continuous triggers for ongoing, real-time processing. Additionally, workflows offer notifications (via email, Teams, or Slack) to track job statuses, making them similar to tools like Airflow and ADF.

Comparison: Azure Databricks vs. Azure Data Factory

Azure Databricks (ADB), built on Apache Spark, offers advanced capabilities for distributed computing, real-time analytics, machine learning, and complex transformations. ADB supports multiple languages like Python, Scala, SQL, and R. With efficient job cluster utilization, it reduces execution time and cost, making it ideal for large-scale ETL and ML workloads.

DBFS (Databricks File System)

DBFS (Databricks File System) is a distributed file system integrated with Databricks, designed for efficient data storage and management within Databricks environments. It acts as an abstraction layer on top of cloud storage solutions such as Amazon S3 and Azure Data Lake Storage, offering essential features for data handling and access control.

Key Features of DBFS:

  1. File Management: Users can perform file and directory operations like upload, download, delete, and manipulation using DBFS commands or APIs.
  2. Data Persistence: DBFS ensures durable storage of data, maintaining availability and accessibility even if clusters are terminated or restarted.
  3. Access Control: Administrators can set and manage access permissions to files and directories stored in DBFS, enforcing data security and compliance with governance policies.
  4. Integration: DBFS seamlessly integrates with various Databricks components such as notebooks, jobs, and clusters, facilitating unified data access and processing workflows.

Usage and Path Convention:

DBFS paths are prefixed with "dbfs:" to distinguish them from paths to local file systems or cloud storage. This convention ensures consistent and clear identification of DBFS resources across different contexts within Databricks.

%fs Commands:

The %fs command is a Databricks-specific magic command used in notebooks to interact with the Databricks File System (DBFS). It allows users to perform various file system operations directly within the notebook.

  • %fs cp:
    Copies files between DBFS and local file systems.
    Syntax: %fs cp source_path destination_path
    Example:
    %fs cp dbfs:/path/in/dbfs/file /path/to/local/directory
  • %fs ls:
    Lists files and directories in DBFS.
    Syntax: %fs ls dbfs_path
  • %fs mv:
    Moves files within DBFS.
    Syntax: %fs mv source_path destination_path
  • %fs rm:
    Deletes files and directories in DBFS.
    Syntax: %fs rm dbfs_path
  • %fs mkdirs:
    Creates directories in DBFS.
    Syntax: %fs mkdirs dbfs_path
  • %fs head:
    Displays the first few lines of a file in DBFS. Default 10 lines.
    Syntax: %fs head dbfs_file_path
  • %fs tail:
    Displays the last few lines of a file in DBFS. Default 10 lines.
    Syntax: %fs tail dbfs_file_path
  • %fs mount:
    Mounts an external storage system to DBFS.
    Syntax: %fs mount source_path dbfs_mount_point
  • %fs unmount:
    Unmounts a storage system from DBFS.
    Syntax: %fs unmount dbfs_mount_point
  • %fs help:
    Displays help information about %fs commands.
    Syntax: %fs help command_name

dbutils (Databricks Utilities)

dbutils provides a set of methods and APIs that enable users to perform common tasks such as file management, database access, cluster management, and job execution within Databricks notebooks and jobs. It simplifies interactions with the Databricks environment and enhances productivity for data engineers, data scientists, and analysts.

Features:

  • File Management: dbutils allows users to upload, download, list, and manipulate files in DBFS (Databricks File System) and local file systems.
  • Database Access: Users can execute SQL queries, create tables, and interact with databases such as Delta Lake, SQL Data Warehouse, and other JDBC/ODBC-compatible databases.
  • Cluster Management: dbutils provides methods for managing clusters, including starting, stopping, resizing, and configuring cluster settings.
  • Job Execution: Users can schedule and monitor jobs, access job parameters and outputs, and interact with job execution contexts.

Users can access dbutils methods within Databricks notebooks and jobs by prefixing the method calls with dbutils. The available methods cover a wide range of functionalities, including file operations, database interactions, cluster management, and job execution.

Common dbutils.fs Commands:

  • dbutils.fs.ls:
    Lists files in a specified directory.
    files = dbutils.fs.ls("/mnt/data")
  • dbutils.fs.mkdirs:
    Creates directories in the specified path.
    dbutils.fs.mkdirs("/mnt/data/new_directory")
  • dbutils.fs.cp:
    Copies files from source to destination.
    dbutils.fs.cp("dbfs:/mnt/data/file.txt", "dbfs:/mnt/data/backup/file_backup.txt")
  • dbutils.fs.mv:
    Moves files from source to destination.
    dbutils.fs.mv("dbfs:/mnt/data/file.txt", "dbfs:/mnt/data/archive/file.txt")
  • dbutils.fs.rm:
    Removes files from the specified path.
    dbutils.fs.rm("dbfs:/mnt/data/obsolete_file.txt")
  • dbutils.fs.head:
    Displays the first few lines of a file.
    content = dbutils.fs.head("dbfs:/mnt/data/sample_file.txt")
  • dbutils.fs.help:
    Provides help and documentation on file system commands.
    dbutils.fs.help()

Note: dbutils is a powerful tool for interacting with various components of the Databricks environment, enabling users to perform a wide range of tasks efficiently within notebooks and jobs.

Library Management:

In Databricks, dbutils.library provides commands for managing libraries. Libraries are external packages or modules that we can install and use in our Databricks environment.

Install a Library:

  • Install a Python library from PyPI:
    dbutils.library.install("pandas")
  • Install a Scala library from Maven:
    dbutils.library.install("com.databricks:sparkcsv_2.11:1.5.0")
  • Restart the Python process to apply library changes:
    dbutils.library.restartPython()

List Installed Libraries:

installed_libraries = dbutils.library.list()
for library in installed_libraries:
  print(library)

Uninstall a Library:

dbutils.library.uninstall("pandas")

Attach a Library to a Cluster:

  • Attach a library to the current cluster:
    dbutils.library.attach("pandas")
  • Attach a library to a specific cluster by specifying the cluster ID:
    dbutils.library.attach("pandas", cluster_id="123456789012345")

Detach a Library from a Cluster:

  • Detach a library from the current cluster:
    dbutils.library.detach("pandas")
  • Detach a library from a specific cluster by specifying the cluster ID:
    dbutils.library.detach("pandas", cluster_id="123456789012345")

List Attached Libraries on a Cluster:

attached_libraries = dbutils.library.clusterStatus()
for library in attached_libraries:
  print(library)

Note: These commands enable users to install, uninstall, attach, detach, and list libraries in their Databricks environment. They facilitate library management and ensure that required dependencies are available for data processing and analysis.

Notebook Utilities:

dbutils.notebook.run:
Users can execute another notebook within the current notebook context using this method. The specified notebook is executed, and any returned result is captured for further processing.

result = dbutils.notebook.run("MyOtherNotebook", timeout_seconds=600)
print(result)

dbutils.notebook.exit:
Users can stop the execution of the current notebook programmatically using this method. This is useful for terminating execution when certain conditions are met or when the notebook completes its tasks.

dbutils.notebook.exit("Notebook execution completed.")

dbutils.notebook.entry_point:
Users can obtain the notebook ID of the entry point notebook (the notebook that initiated the current notebook execution) using this method. This is useful for tracking the flow of notebook executions and managing dependencies between notebooks.

entry_point_notebook_id = dbutils.notebook.entry_point().notebook_id
print("Entry point notebook ID:", entry_point_notebook_id)

dbutils.notebook.help:
Users can access help and documentation for notebook utilities using this method. It provides information about available notebook utility methods and their usage.
Usage: To get help and documentation on notebook utilities:

Note: These notebook utilities provide additional functionality for managing notebook execution flow, inter-notebook communication, and termination of notebook execution as needed.

Widgets:

In Databricks notebooks, widgets are interactive elements that allow users to input values or make selections dynamically. They are useful for creating dynamic and interactive data visualizations or for parameterizing our notebooks.

Combobox Widget:
Usage: Allows users to select a value from a list of options or input a new value.

dbutils.widgets.combobox(name='paramcombo', defaultvalue='apple', choices=['apple', 'banana', 'orange'], label='Fruitscombobox')

Dropdown Widget:
Usage: Allows users to select a single value from a list of options.

dbutils.widgets.dropdown(name='paramdown', defaultvalue='apple', choices=['apple', 'banana', 'orange'], label='Fruitsdropdown')

Multiselect Widget:
Usage: Allows users to select multiple values from a list of options.

dbutils.widgets.multiselect(name='parammultiselect', defaultvalue='apple', choices=['apple', 'banana', 'orange'], label='Fruitsmultiselect')

Text Widget:
Usage: Allows users to input text.

dbutils.widgets.text(name='paramtext', defaultvalue='apple', label='Fruitstext')

List all widgets:

widget_list = dbutils.widgets.widgetsList()
Or
%sql SHOW WIDGETS;

Remove Widgets:

dbutils.widgets.remove('widget_name')
dbutils.widgets.removeAll()

Get Widget Value:
To retrieve the value of a widget in Databricks:

param_value = dbutils.widgets.get("input_param")

Example: Using Parameters in PySpark Query:
# Create a widget to accept input (state)

dbutils.widgets.text("state", "CA", "State")
# Retrieve the value of the widget
state = dbutils.widgets.get("state")
# Use the parameter in a PySpark SQL query
df = spark.sql(f"SELECT * FROM customers WHERE state = '{state}'")

Passing Parameters to Notebooks:
In the Parent Notebook, you can pass parameters to the Child Notebook:
# Pass parameters to the Child Notebook

dbutils.notebook.run("ChildNotebook", 60, {"param1": "Value1", "param2": "Value2"})
Child Notebook (Receiving Parameters):
Inside the Child Notebook, you define widgets to receive the parameters:
# Create widgets to receive parameters
dbutils.widgets.text("param1", "")
dbutils.widgets.text("param2", "")
# Retrieve the values passed from the parent notebook
param1_value = dbutils.widgets.get("param1")
param2_value = dbutils.widgets.get("param2")
# Print the received values
print(f"Received: {param1_value}, {param2_value}")

Passing Parameters to Databricks Jobs:
When scheduling a Databricks Job, you can pass parameters to the notebook:

{
  "notebook_task": {
    "notebook_path": "/Repos/MyNotebook",  # Path to the notebook
    "base_parameters": {
      "param1": "Value1",  # Value for param1
      "param2": "Value2"   # Value for param2
    }
  }
}

getArgument for Retrieving Parameters in Jobs:
The getArgument function is used when a notebook is run as part of a Databricks job or from another notebook. It allows you to retrieve parameters passed to jobs or triggered notebooks.
Note that dbutils.widgets.get() is used in general notebook-to-notebook communication via widgets, whereas getArgument is specifically for jobs and task contexts.
# Inside the notebook (for job or task context)

param1 = dbutils.jobs.taskContext.getArgument("param1")
param2 = dbutils.jobs.taskContext.getArgument("param2")
# Print the received parameters
print(f"Received param1: {param1}, param2: {param2}")

Creating Widgets using SQL:

CREATE WIDGET MULTISELECT multiSelectwidget DEFAULT 'option1' CHOICES SELECT option FROM widget_choices
CREATE WIDGET DROPDOWN dropdownWidget DEFAULT 'option1' CHOICES SELECT option FROM widget_choices
CREATE WIDGET COMBOBOX comboboxWidget DEFAULT 'option1' CHOICES SELECT option FROM widget_choices
create widget text textwidget default 'option1'

Secret Scope:

What is a Databricks secret scope in Azure Databricks?
Databricks secret scopes in Azure Databricks securely store and manage sensitive information like database connection strings and API keys. Integrated into the workspace, they provide unified management with access control, versioning, auditing, and encryption, ensuring security and compliance. They can be created via the CLI, REST API, or workspace UI.

Here are two types of Secret Scopes in Databricks:

Azure Key Vault backed Secret Scope

This type of secret scope is backed by Azure Key Vault. It allows you to store secrets in Azure Key Vault and then reference them in Databricks notebooks, jobs, and clusters.

Create an Azure Key Vault-backed Secret Scope:
# Create a new Azure Key Vault-backed secret scope

dbutils.secrets.createScope("my_azure_keyvault_scope")
# Set a secret in the scope using Azure Key Vault
dbutils.secrets.set("my_azure_keyvault_scope", "my_secret_key", "my_secret_value")

Retrieve a Secret from Azure Key Vault-backed Secret Scope:
# Retrieve the secret from the Azure Key Vault-backed secret scope

secret_value = dbutils.secrets.get("my_azure_keyvault_scope", "my_secret_key")
print(secret_value)

Databricks Secret Scope

This type of secret scope is backed by Databricks. It allows you to store secrets directly within Databricks, and the secrets are encrypted using Databricks' own encryption mechanisms.

Create a Databricks Secret Scope:
# Create a new Databricks secret scope

dbutils.secrets.createScope("my_databricks_scope")
# Set a secret in the scope
dbutils.secrets.set("my_databricks_scope", "my_secret_key", "my_secret_value")

Retrieve a Secret from Databricks Secret Scope:
# Retrieve the secret from the Databricks secret scope

secret_value = dbutils.secrets.get("my_databricks_scope", "my_secret_key")
print(secret_value)
secret_value = dbutils.secrets.get(scope="", key="")
print(secret_value)

List Secrets in a Scope:

secrets_list = dbutils.secrets.list(scope="")
print(secrets_list)

Remove a Secret:

dbutils.secrets.delete(scope="", key="")

List Available Scopes:

scopes_list = dbutils.secrets.listScopes()
print(scopes_list)

Scopes:
Secrets are organized into scopes, which provide a way to manage and organize related secrets.

Security:
Secrets are stored securely and are encrypted at rest.

Usage in Notebooks:
Secrets can be easily used within Databricks notebooks and scripts.

Mount Point

In Databricks, a mount point is a way to attach external storage to the Databricks File System (DBFS), making it accessible within our Databricks workspace. This enables us to interact with external data sources as if they were part of the DBFS.

Mounting Azure Blob Storage

Replace these placeholders with our actual values:

  • storage_account_name = "<ourstorageaccountname>"
  • container_name = "<ourcontainername>"
  • sas_token = "<oursastoken>"

Mount Azure Blob Storage as a DBFS mount point

dbutils.fs.mount(source=f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net/", mount_point="/mnt/mymountpoint", extra_configs={"fs.azure.sas...blob.core.windows.net": sas_token})

Mounting an Azure Data Lake Storage Gen2

Replace the placeholders with our actual values:

  • storage_account_name = "our_storage_account_name"
  • container_name = "our_container_name"
  • directory_key = "our_directory_key"
  • mount_point = "/mnt/adlsgen2"

Define the storage account key:

storage_account_key = spark.conf.get("fs.azure.account.key." + storage_account_name + ".dfs.core.windows.net")

Mount ADLS Gen2:

dbutils.fs.mount(source=f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/", mount_point=mount_point, extra_configs={f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net": storage_account_key})

List the contents of the mount point:

display(dbutils.fs.ls(mount_point))

Mounting an SQL Server database

Set the necessary variables:

  • jdbcUsername = "our_username"
  • jdbcPassword = "our_password"
  • jdbcHostname = "our_sql_server_host"
  • jdbcPort = "our_sql_server_port"
  • jdbcDatabase = "our_database_name"

jdbcUrl = f"jdbc:sqlserver://{jdbcHostname}:{jdbcPort};database={jdbcDatabase};user={jdbcUsername};password={jdbcPassword};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"

mountPoint = "/mnt/sqlservermountpoint"

Mount the SQL Server database:

dbutils.fs.mount(jdbcUrl, mountPoint)

List the existing mount points:

%fs mounts

Unmount:

%fs unmount /mnt/sqlservermountpoint

Update Mount

In Databricks, if we want to update the configuration of an existing mount point, we need to unmount it and then remount it with the updated configuration.

Catalog Functions

The Catalog provides a set of functions and methods to interact with metadata in the SparkSession, including managing databases, tables, and views.

Create Database:

Syntax: spark.catalog.createDatabase("new_database")
Description: Creates a new database with the specified name and optional path, description, and properties.

List Databases:

Syntax:

databases_df = spark.catalog.listDatabases()
databases_df.show()

Description: Returns a DataFrame of databases available in the SparkSession.

Drop Database:

Syntax: spark.catalog.dropDatabase("old_database")
Description: Drops the database with the specified name. If ifExists is True, it does not throw an error if the database does not exist. If cascade is True, it drops all tables in the database as well.

Create Table:

Syntax: spark.catalog.createTable("new_table")
Description: Creates a new table with the specified name, optional path, source (e.g., "parquet"), schema, and additional options.

List Tables:

Syntax:

tables_df = spark.catalog.listTables()
tables_df.show()

Description: Returns a DataFrame of tables in the specified database or the current database if database is not provided.

Drop Table:

Syntax: spark.catalog.dropTable("old_table")
Description: Drops the table with the specified name. If ifExists is True, it does not throw an error if the table does not exist.

List Functions:

Syntax:

functions_df = spark.catalog.listFunctions()
functions_df.show()

Description: Returns a DataFrame of user-defined functions (UDFs) registered in the SparkSession.

dropTempView:

Syntax: spark.catalog.dropTempView("temp_view")
Description: Drops the temporary view with the specified name.

dropGlobalTempView:

Syntax: spark.catalog.dropGlobalTempView("global_temp_view")
Description: Drops the global temporary view with the specified name.

Best Practices

Cost Optimization

  • Use cluster auto-scaling, Spot Instances, and optimize job scheduling to reduce costs.

Governance

  • Implement Unity Catalog for data governance, AAD for access control, and enable audit logging.

Large-Scale Workloads

  • Use Databricks pools, Delta Lake for ACID transactions, and secure networking with VNet Injection.
Share This :
Scroll to Top