Passing Data from Azure Databricks Notebooks into Azure Data Factory

Rachit Daga
3 min readApr 5, 2021

Azure Databricks has good integration options with Azure Data Factory. From calling notebooks, to libraries. But when it comes to getting a value, or a table from Databricks, I found myself going back to spark dataset based lookups which has it’s own limitations on the queries we can write to retrieve the required information. Recently, however, I started looking beyond the lookups and found an easy way to directly pass data from Notebooks to Data Factory.

Databricks allows us to pass messages to the caller of notebooks using the command:

dbutils.notebook.exit('Notebook Return Value')

On calling the notebook from Azure Data Factory(ADF) activity [Run Notebook], we can simply retrieve the return string value ('Notebook Return Value') by using the below command in subsequent activities:

@activity('Run Notebook').output.runOutput

Instead of string, we can also pass json values or lists from our notebook to the ADF like:

dbutils.notebook.exit('{"Json": {"Value": "Output"}}') or dbutils.notebook.exit(['A','B','C']),

and it can be easily retrieved in the activities as:

@activity('Run Notebook').output.runOutput.Json.Value for JSON strings and,

@activity('Run Notebook').output.runOutput[2] for lists.

Retrieving data tables from Notebooks into ADF

Now that our base is set and tested, let us get data from tables into ADF using notebooks.

For this, I have created a table with 10 records.

Also, I have created a notebook with the python command as below:

import json
data = spark.sql("SELECT * FROM ADFDataPulls").toJSON().collect()
dbutils.notebook.exit(data)

This is returning a Json string to the ADF. However, when looping through the output, each row is treated as a string at the ADF end, thereby giving an error if we reference the same by the row name.

ADF Notebook Activity Output

Hence, we need to cast it to json using the @json() option, which then allows us to read data in the correct format.

Using the for loop over the runOutput and applying the json cast over the items @json(item()).IDgives us the output that we require.

To test the same I have created the sample pipeline with Notebook task and Loop task, where, in the loop, we are appending ID values from our above table to a variable (test) of array type.

Sample Pipeline with Notebook and Loop tasks
ID value appended for Row 2

With this, we now have an easy way to loop through our records just as we would with data returned from Lookups. There is 1 catch though. There is a size limit of 2MB for the returned data, so do not go above that.

Conclusion

This simple hack, makes our life easy to abstract our code from the ADF to the notebooks and get our output easily for further processing. Also, it takes us away from the lookup approach where we need to depend on access token of a user of our Databricks Environment for connection as we can use the Managed Identity option for ADF Databricks Activity for Notebooks.

--

--

Rachit Daga

Tech Enthusiast, Software Engineer @Google, Ex-Microsoft