Azure Data Factory -

What is Azure Data Factory?

Definition

Azure Data Factory is a cloud-based data integration service offered by Microsoft. It provides a platform for creating, scheduling, and managing data workflows, referred to as pipelines. These data pipelines enable the efficient extraction of data from diverse sources, necessary transformation, and subsequent loading into various data stores.

Types

Data Factory V1:
- No GITHUB Integration.
- Missing many activities and transformations.
- No triggers for scheduling.
Data Factory V2:
- GITHUB Integration available.
- Triggers available for scheduling.
- Enhanced activities and transformations.
- Improved workflow, scheduling, and compute infrastructure.

Main Components in Azure Data Factory

Author & Monitor

This is the user interface where we create, schedule, and monitor data pipelines.

Data Pipelines

Data Pipelines will define the data movement and transformation activities.

Data Sets

Data Sets represent the data structure of tables or files. Datasets are the inputs and outputs of data pipelines.

Linked Services

Linked Services contain the connection information for data sources and destinations, such as databases and storage accounts.

Integration Runtime (IR)

Integration runtime is the compute infrastructure used by Azure Data Factory (ADF) to provide various data integration capabilities across different network environments. It is a key component in data integration and data processing solutions. It's responsible for connecting and moving data between different systems. There are three main types:

Self-hosted Integration Runtime

Self-hosted Integration Runtime is installed on on-premises machine and facilitates connections and data movement between on-premises data sources and cloud services.

Azure Integration Runtime

Azure Integration Runtime, managed by Azure, is designed for cloud data integration, and connecting various Azure services.

SSIS Integration Runtime

SSIS Integration Runtime is employed for executing SQL Server Integration Services packages in Azure, facilitating ETL (Extract, Transform, Load) processes.

Activities

Copy Activity

Copy Activity in Azure Data Factory facilitates seamless data transfer between source and destination, enabling efficient data movement and transformation tasks within the data integration workflow. The "Copy Data" activity primarily focuses on data movement but also supports a few basic data transformation tasks, including:

Column Mapping

We can map columns from the source dataset to the destination dataset. This allows us to rename, reorder, or omit columns during the data transfer.

Data Type Conversion

We can perform simple data type conversions during the copy operation, like converting a string column to a date column.

Data Filtering

We can apply basic filters to the data, filtering rows based on specific conditions to get relevant data.

Other Options

Recursively: Reads all files inside subfolders automatically.
Enable Partition Discovery: Detects and processes partitioned data in a folder structure.
Partition Root Path: Defines where partition detection starts in the folder structure.
Max Concurrent Connections: Controls how many files or rows are processed at the same time for better speed.
Skip Line Count: Skips the first few lines in a file (useful for removing headers).
Additional Columns: Adds extra columns to the output, such as timestamps or custom values.
Data Integration Unit (DIU): Determines the computing power used for data movement. More DIUs mean faster processing but higher cost.
Degree of Parallelism: Defines how many parallel tasks ADF runs at once to speed up data transfer. Increasing this value improves performance but may overload the source system. The maximum number of parallelisms is 32, and we can set it between 10 to 20.
Column Pattern: A dynamic way to map multiple columns based on rules instead of defining them manually. Useful for handling large datasets with changing structures.

Data Flow Activity

The "Data Flow" activity in Azure Data Factory is a crucial component for extracting, transforming, and loading data from various sources to customized destinations. Its visual interface simplifies the design of data transformation processes, enabling users to effortlessly clean, enrich, and prepare data for analytical insights or reporting. It supports a broad spectrum of data sources and transformations.

Mapping Data Flows in Azure Data Factory (ADF) do not directly support on-premises SQL Server as a source or sink. The reason is that Mapping Data Flows rely on Azure Integration Runtime (IR) for execution, which is not optimized for handling on-premises data sources.

However, you can still work with on-premises SQL Server in ADF by using a Self-hosted Integration Runtime (SHIR), but it won't be a part of the Mapping Data Flows. Instead, you would need to use Copy Activity or other pipeline activities, where the self-hosted IR acts as a bridge to move data from on-premises to cloud-based services like Azure Data Lake, SQL Database, or Synapse Analytics.

Log

✔ Data Flow Logs – Provide transformation-level insights, including row counts at each transformation, partitioning details, execution stages, and transformation times but do not show number of files processed or DIU usage explicitly like Copy Activity.

Allow Schema Drift

Schema drift allows changes in columns from source to target dynamically, making it useful when schema changes frequently. This helps avoid pipeline failures due to unexpected schema modifications.

📌 Availability in ADF:

✔ Available in Mapping Data Flow – Allows processing data with changing schemas in Source, Sink, and Select transformations.
✔ Not available in Copy Activity – Instead, enable "Auto Mapping" to handle schema changes dynamically.
✔ Not available in other ADF activities – Schema drift does not apply to Lookup, Stored Procedure, ForEach, etc.

Example: You have a data pipeline that processes a CSV file. Initially, the file contains columns ID, Name, and Age. Later, the file is updated to include a new column Email. With schema drift enabled, the pipeline automatically accommodates the new Email column without requiring modifications to the pipeline.

Validation Schema

The Validation Schema feature in Mapping Data Flows ensures that the incoming data matches the predefined schema of the source dataset.
If there are any schema changes (such as column additions, removals, or data type changes), the pipeline will fail during execution.
It enforces strict schema validation to maintain data consistency and prevent unexpected issues during transformations.

Infer Drifted Column

The Infer Drifted Column Types option in Mapping Data Flows allows Azure Data Factory to automatically detect new (drifted) columns in incoming data and assign appropriate data types based on the data values.
By default, drifted columns are treated as string data types. Enabling this option ensures that the system infers their actual data types (e.g., integer, float, date, etc.).

Web Activity

A "WebActivity" in Azure Data Factory is used to make HTTP requests to external Webservices, APIs, or Websites. It allows to retrieve or send data from and to Web resources as part of our data workflows. This activity is commonly used for data extraction, triggering actions or interaction with external web services.

HDInsight Spark Job Activity

The HDInsight Spark Job Activity in Azure Data Factory is responsible for executing Apache Spark jobs on HDInsight clusters, enabling robust big data processing, performance optimization, and customization options for specific big data processing requirements.

Databricks Notebook Activity

The Databricks Notebook Activity in Azure Data Factory executes Databricks notebooks, facilitating data transformation and analysis within the Azure Databricks environment.

Stored Procedure Activity

The "Stored Procedure" activity in Azure Data Factory is used to execute stored procedures within a relational database, like Azure SQL Database. It enables the passing of parameters and initiation of database actions defined in the stored procedure as part of a data pipeline workflow. This facilitates data processing and manipulation within the database.
It's important to note that stored procedures executed using this activity do not produce output for subsequent activities.

Lookup Activity

The "Lookup" activity in Azure Data Factory is used to query and retrieve data from any of the azure data factory dataset (e.g., a database or a file) and the output from lookup activity can be used in a subsequent activity. it reads and returns the content of a configuration file or table, and it also returns the result of executing a query or stored procedure.
The Lookup activity can return up to 5000 rows, if the result set contains more records, the first 5000 rows will be returned. The Lookup activity output support up to 4 MB in size, activity will fail if the size exceeds the limit.

If Condition Activity

The "If Condition" activity in Azure Data Factory is used to implement conditional logic within a data pipeline. It allows to execute different activities based on a specified condition. If the condition is true, one set of activities is executed; if false, another set of activities is executed. This helps control the flow of our data pipeline based on specific criteria or business rules.

ForEach Activity

The "ForEach" activity in Azure Data Factory is used to iterate over a collection of items, such as files in a folder or tables in database and perform a set of actions for each item. It's useful for automating repetitive tasks, like processing multiple files, by applying the same data flow or activity to each item in the collection.

Limitations:

We can't nest a ForEach loop inside another ForEach loop (or an Until loop). Instead, design a two-level pipeline where the outer pipeline contains the ForEach loop and calls an inner pipeline with the nested loop.
SetVariable cannot be used inside a ForEach activity that runs in parallel because variables are global to the entire pipeline and are not scoped to a ForEach or any other activity.
The ForEach activity supports a maximum batchCount of 50 for parallel processing and can iterate over a maximum of 100,000 items.

Delete Activity

The "Delete" activity in Azure Data Factory is used to remove files or objects from a data store or location, facilitating data cleanup and maintenance tasks.

Wait Activity

The "Wait" activity in Azure Data Factory is used to introduce a pause or delay in a data pipeline. It allows to wait for a specified period before proceeding to the next activity, useful for scheduling and coordination in data workflows. The "Wait" activity does not perform any data processing but adds temporary control to our pipeline.

Get Metadata Activity

The "Get Metadata" activity in Azure Data Factory retrieves metadata information about files, datasets, or folders from a data store such as Azure Blob Storage, Azure SQL Data Warehouse, Azure Data Lake Storage, and others. It provides details like file names, sizes, modification dates, and more. This information can be used to make data-driven decisions within our data pipeline.

Set Variable

Variables are used to store the data during the execution of our data workflows. The "Set Variable" activity allows to assign values to these variables, and those values can be used in subsequent activities or expressions within the pipeline.

Append Variable

The Append Variable activity is used to add a value to an existing array variable defined in a Data Factory or Synapse Analytics pipeline. The appended variable value does not appear in debug output unless use a Set Variable activity to explicitly set a new variable with its value.

Until Activity

The Until activity in Azure Data Factory executes a set of activities in a loop until a specified condition evaluates to true or a defined timeout period is reached. By default, it continues executing even if inner activities fail unless failure conditions or error handling are explicitly defined.

Execute Pipeline Activity

The "Execute Pipeline" activity in Azure Data Factory enables a Data Factory pipeline to invoke and execute another pipeline.

Fail Activity

The Fail activity in Azure Data Factory is used to intentionally fail a pipeline by throwing a custom error message and error code. This is useful in scenarios where failure must be explicitly triggered, such as when a Lookup activity does not find any matching data at all or when a Custom activity encounters an internal error. This allows for customized error handling and meaningful error messages for better troubleshooting.

Validation Activity

The Validation activity in Azure Data Factory is used to check whether a specified dataset exists or meets certain conditions before proceeding to the next activity. It is commonly used to validate the availability of files, tables, or other data sources before executing downstream activities.

Webhook Activity

The Webhook activity in Azure Data Factory is used to interact with external web services, APIs, or websites via HTTPS requests to send or retrieve data. Once it sends an HTTP request, it does not wait actively but listens for a callback URL response from the external service. Instead of continuously waiting, it remains idle until the external system triggers the callback URL. This enables asynchronous interaction, making it useful for long-running external processes.

Switch Activity

The Switch activity in Azure Data Factory is used to control the flow of execution based on multiple conditions or cases. Each case corresponds to a specific value or condition, and the activity will direct the process to the matching activity. If none of the cases match, the flow will proceed to the default activity.

Transformations

Azure Data Factory Data Flow provides several transformations for data processing. Here's a short list of some common transformations and their usages:

Source Transformation

Source transformation is used to configure the source dataset for a data flow. Each data flow must have at least one source transformation, but multiple source transformations can be added. Each source transformation is associated with a single linked service and dataset.

Select Transformation

Use the select transformation to rename, drop, or reorder columns. This transformation doesn't alter row data, but chooses which columns are propagated downstream.

In a select transformation, we can specify fixed mappings, use patterns to do rule-based mapping, or enable auto mapping.
Fixed and rule-based mappings can both be used within the same select transformation.
If a column doesn't match one of the defined mappings, it will be dropped.

Derived Column Transformation

Use the derived column transformation to generate new columns in our data flow or to modify existing fields.

When creating a derived column, we can either generate a new column or update an existing one.
In cases where our schema is not explicitly defined or if we want to update a set of columns in bulk, we will want to create column patterns. Column patterns allow us to match columns using rules based upon the column metadata and create derived columns for each matched column.

Filter Transformation

The Filter transforms allows row filtering based upon a condition. The output stream includes all rows that match the filtering condition.

Aggregate Transformation

Aggregate transformation is used to group and aggregate data using functions like SUM, MIN, MAX, AVG, and COUNT, based on a specified column or expression. It supports multiple aggregations within a single transformation. If no group-by column is specified, the entire dataset is treated as a single group for aggregation.

Join Transformation

Use the join transformation to combine data from two sources or streams in a mapping data flow. The output stream will include all columns from both sources matched based on a join condition.

Mapping data flows currently supports five different join types:

Inner Join: Inner join only outputs rows that have matching values in both tables.
Left Outer: Left outer join returns all rows from the left stream and matched records from the right stream. If a row from the left stream has no match, the output columns from the right stream are set to NULL.
Right Outer: Right outer join returns all rows from the right stream and matched records from the left stream. If a row from the right stream has no match, the output columns from the left stream are set to NULL.
Full Outer: Full outer join outputs all columns and rows from both sides with NULL values for columns that aren't matched.
Custom Cross Join: For custom cross joins, the operation combines two data streams based on specified conditions. If the condition isn't based on equality, you can define a custom expression. The resulting stream includes all rows that meet the join criteria. This method supports non-equi joins and OR conditions.

For a full Cartesian product, use the Derived Column transformation to create a synthetic key in each stream. For example, create a new column in Derived Column in each stream called Synthetic Key and set it equal to 1. Then use a.SyntheticKey == b.SyntheticKey as our custom join expression.

Make sure to include at least one column from each side of our left and right relationship in a custom cross join. Executing cross joins with static values instead of columns from each side results in full scans of the entire dataset, causing our data flow to perform poorly.

Note: The Spark engine used by data flows will occasionally fail due to possible cartesian products in our join conditions. If this occurs, we can switch to a custom cross join and manually enter our join condition. This may result in slower performance in our data flows as the execution engine may need to calculate all rows from both sides of the relationship and then filter rows.

Flowlet Transformation

Flowlet Transformation is a reusable data transformation logic in Mapping Data Flows. It allows you to create transformation logic that can be used across multiple data flows, reducing redundancy and improving maintainability.

Flatten Transformation

Flatten Transformation is used to convert nested array structures, such as JSON arrays or complex data types, into individual rows in a tabular format in Mapping Data Flows.

Alter Row

The Alter Row transformation in Azure Data Factory allows to define row-level policies for insert, update, delete, and upsert operations based on specified conditions. It is particularly useful for handling incremental data loading and performing complex data transformations.

External Call Transformation

External Call Transformation is used in Mapping Data Flows to send or retrieve data from an external REST API or web service during data transformation. It enables interaction with external systems to fetch additional data, validate records, or enrich data in real time within Azure Data Factory.

Exists Transformation

Exists Transformation is used in Mapping Data Flows to check whether records in one dataset exist in another dataset. Allowing filtering based on matching or non-matching records between two datasets.

It's commonly used for scenarios like upsert operations, where you need to determine if a record already exists in the target dataset before deciding whether to update or insert the record.

Cast Transformation

Use the cast transformation to easily modify the data types of individual columns in a data flow. The cast transformation also enables an easy way to check for casting errors.

Sort Transformation

The sort transformation allows to sort the rows on the data stream. We can choose individual columns and sort them in ascending or descending order.

Conditional Split Transformation

Condition Split Transformation is used in Mapping Data Flows to route data into multiple output streams based on defined conditions. It acts like an IF-ELSE statement, allowing different subsets of data to be processed separately.

Lookup Transformation

Lookup Transformation is used to retrieve data from another dataset based on a matching condition. It works similarly to a left outer join, where all records from the input dataset are retained, and matching records from the lookup dataset are added. If no match is found, the result contains NULL values for the lookup columns.

Pivot Transformation

The Pivot transformation in Azure Data Factory is used to convert rows into multiple columns. It involves aggregating data by selecting group-by columns and generating pivot columns using aggregate functions.

Unpivot Transformation

The Unpivot transformation in Azure Data Factory is utilized to convert columns into rows, serving as the inverse operation of the Pivot transformation. This transformation is frequently employed for data normalization in reporting or analytics scenarios, especially when aggregating or analyzing data distributed across multiple columns.

New Branch

New Branch Transformation is used in Mapping Data Flows to duplicate the input stream into multiple branches within the same data flow. This allows applying different transformations to the same dataset without modifying the original data flow.

Sink Transformation

Sink Transformation is used in Mapping Data Flows to write or store the transformed data into a destination, such as a database, data lake, or file storage. It defines the final output location and supports settings like insert, update, upsert, and delete operations based on business requirements.

Each sink transformation is associated with exactly one dataset object or linked service. The sink transformation determines the shape and location of the data we want to write to.

Window Transformation

Window Transformation is used to perform aggregations and analytical operations across a set of rows related to the current row within a defined window. It supports functions like LEAD, LAG, RANK, DENSE_RANK, ROW_NUMBER, SUM, AVG, MIN, MAX, and others, enabling complex calculations without grouping the data.

Rank Transformation

Rank Transformation is used to assign a ranking value to each row based on a specified sorting order. It helps in ordering and ranking data within partitions.

Union Transformation

The Union transformation in Azure Data Factory combines multiple data streams or datasets into one unified dataset, integrating sources from different origins or existing transformations within the data flow. We can merge any number of streams in the settings table by selecting the '+' icon next to each configured row.

Surrogate Key Transformation

Surrogate Key Transformation is used to generate unique incremental numeric values for each row in a data flow. It is commonly used to create surrogate keys in dimensional modeling or when loading data into a Data Warehouse.

HDInsight Spark Transformation

Execute custom Spark code for advanced data processing.

Other Concecpts

Triggers

• Schedules the pipeline to run automatically at a specific time.

Types of triggers:

Schedule Trigger – Executes the workflow at specific time intervals, such as hourly, daily, or monthly, based on a predefined schedule.
Tumbling Window Trigger – Executes the workflow at fixed, non-overlapping time intervals (e.g., every 15 minutes, hourly). Each window processes data independently and does not overlap with the next window.
Event-Based Trigger – Executes the workflow based on storage events, such as file arrival, deletion, or modification in Azure Blob Storage or Azure Data Lake Storage.

Partitions

Partitions are essential for optimizing performance while querying, as they allow processing only relevant partitions, reducing resource consumption and speeding up data retrieval. In ADF, data movement is automatically optimized based on available resources, and users can also define partitioning explicitly.

Partitions in Copy Activity:

If the source is a table, Azure Data Factory's Copy Activity supports two types of partitioning: Physical Partitioning and Dynamic Range Partitioning. These partitioning strategies work for the source but not directly for the sink in a Copy Activity.

If the source is a file and the sink is either a table or another file, explicit partitioning is not available. Instead, ADF achieves parallelism through the Degree of Parallelism setting, available DIUs, and by automatically reading multiple files in parallel.

Performance Optimization:

While ADF can automatically read multiple files in parallel, for better performance and fine-grained control, you should explicitly configure:

Degree of Parallelism to control how many files are processed in parallel
Ensure your Integration Runtime has enough DIUs to handle the workload effectively.

Partitions in Mapping Data Flow:

If you're using Mapping Data Flows in Azure Data Factory, partitioning plays a crucial role in optimizing performance during data transformations.

By default, Data Flows automatically partition data for parallel processing using Apache Spark under the hood. However, for better control and performance tuning, ADF allows you to configure custom partitioning strategies at both the transformation and sink levels.

🔹 Partitioning Strategies in Data Flows:

ADF Data Flows support the following partitioning strategies:

Hash Partitioning – Distributes rows based on the hash of a specified column. Useful for joins and aggregations.
Round Robin Partitioning – Evenly distributes rows across partitions. Great for load balancing.
Range Partitioning – Distributes rows based on a range of values in a column.
Fixed Partitioning – Manually sets the number of partitions for fine-grained control.
Key Partitioning – Partitions based on key columns, useful for consistent joins and lookups across datasets.

These strategies can be configured via the Optimize tab in transformations such as Join, Aggregate, Lookup, Sink, etc.

🔹 Default Behavior:

By default, ADF uses Spark’s built-in partitioning logic based on the activity and dataset size.

However, explicit partitioning is recommended when working with large datasets or expensive operations like:

Joins
Lookups
Aggregations

🔹 Sink Parallelism in Data Flows:

Sink transformations also support partitioning strategies to optimize parallel data writes.

You can apply the same strategies (Hash, Round Robin, etc.) at the sink level to evenly distribute writes into your sink (such as a file or table), improving load performance.

⚠️ Important Note:

If your sink is a file and is configured to generate a single output file, parallelism is restricted, as Spark cannot write to a single file with multiple threads.
In this case, either avoid partitioning or set the partition count to 1.
If the sink supports multiple output files, ADF can write in parallel according to the partitioning strategy, which enhances performance.

🔧 Performance Tuning Tips:

To maximize performance in Mapping Data Flows:

Choose the appropriate partitioning strategy per transformation.
Set the degree of parallelism in the Data Flow settings.
Ensure your Integration Runtime (IR) has sufficient DIUs (Data Integration Units) to support the workload.

Note:

Manually setting the partitioning scheme reshuffles the data, which can negate the benefits of the Spark optimizer. As a best practice, avoid manual partitioning unless necessary.

If the source is a SQL database, then it will show source partition.

Except for the Source, all other transformations in Mapping Data Flows support custom partitioning strategies.

Source partitioning settings apply automatically based on the source type and integration runtime, but explicit partitioning can still be applied after the source to optimize performance further in transformations and sink operations.

Change Data Capture (CDC)

Change Data Capture (CDC) in ADF is a mechanism used to capture and track changes in a data source and propagate those changes to a target system for processing, analysis, or storage. It enables real-time or incremental data updates without requiring a full data reload. By capturing only the changes, CDC optimizes data processing, minimizes resource consumption, and improves efficiency in data pipelines.

Dependent on ADF Activity?

Success:
- Definition: The dependent activity runs only if the preceding activity completes successfully.
- Use Case: Proceed to the next step only if the current step succeeds.
Failure:
- Definition: The dependent activity runs only if the preceding activity fails.
- Use Case: Execute a specific action or send a notification if an error occurs in the current step.
Skip:
- Definition: The dependent activity runs only if the preceding activity is skipped.
- Use Case: Handle scenarios where a particular step is intentionally bypassed.
Completion:
- Definition: The dependent activity runs regardless of whether the preceding activity succeeds or fails.
- Use Case: Perform cleanup tasks or logging operations that should always run after the current step, irrespective of the outcome.

Debug Mode

Debug mode in Azure Data Factory is used to test and troubleshoot Mapping Data Flows by allowing users to preview results and validate transformation logic before deploying to production. It is specifically designed for Mapping Data Flows and does not apply to other ADF activities.

By default, debug mode does not save output. However, if you want to write output during debugging, you can enable "Allow sink writing" in the debug settings.

Key Points:

✔ Debug mode is only for Mapping Data Flows, not other ADF activities.
✔ Helps preview and validate transformation logic before production deployment.
✔ Allows real-time data inspection while testing.
✔ "Allow sink writing" must be enabled to write output during debugging.

Wrangling Data Flows

Wrangling Data Flows in Azure Data Factory enable processing and transformation using a code-free, interactive data preparation interface powered by Power Query technology. This feature is ideal for users who prefer a visual interface for cleaning and transforming data directly within Azure Data Factory.

Connector File Formats

Connector File Formats in Azure Data Factory (ADF): Azure Data Factory supports various file formats for data ingestion and output. Some of the key formats include:

Parquet: A columnar storage format optimized for performance and efficient data storage. It supports compression and is widely used in the Hadoop ecosystem.
Avro: A row-based format that also supports compression. It is used for data serialization and is part of the Apache Hadoop ecosystem.
ORC (Optimized Row Columnar): Another columnar storage format that is highly optimized for reading large volumes of data quickly. It supports efficient compression and is part of the Apache Hadoop ecosystem.
JSON (JavaScript Object Notation): A lightweight, text-based format often used for data interchange. It does not support advanced compression and tends to result in larger file sizes compared to columnar formats.
Text (Delimited): A format where data is stored as plain text with a delimiter (such as commas or tabs) separating the values. It also lacks advanced compression and typically results in larger file sizes.

Comparison of Formats:

Compression and Storage Efficiency:

ORC, Avro, and Parquet: These formats are part of the Apache Hadoop ecosystem and are designed with compression algorithms that significantly reduce file sizes. They enable faster query performance due to their efficient storage and retrieval mechanisms.
JSON and Text Files: These formats generally have larger file sizes because they do not employ advanced compression algorithms. As a result, queries on data stored in these formats may be slower compared to those stored in ORC, Parquet, or Avro formats.