Python Library/Code Development for Azure Databricks using Visual Studio Code
Databricks provides a great environment for ad-hoc and interactive usage of data. However, setting up a local environment for one’s testing can be quite a task. The below post is intended to help setup a working local environment for development and testing of python libraries for databricks remotely using databricks-connect.
The first thing we require for our development is an IDE. In this post, I have chosen to use Visual Studio Code as my choice due to the wide set of option it provides.
We need a Java Runtime Environment for supporting spark commands for databricks. It is very important to install JRE 8 and not any later version as there was a major change from Java 9 which is not supported by spark. I chose to use AdoptOpenJDK for the same.
When installing JRE, it is important to set up JAVA_HOME variable. You will generally find this option while installation of the JRE of your choice. Alternatively, we can also set it using a powershell command:
[Environment]::SetEnvironmentVariable(“JAVA_HOME”, “C:\Java”, “Machine”)
or manually by going to environment variables and setting the value for the JAVA_HOME as the folder path to the Java installation.
Next we require to install python as that’s what the post is about.
Python can be downloaded from https://www.python.org/downloads/. For our development, it is better to install the version of python same as the python version of our databricks cluster so that we are using the commands that are available in our environment as well.
You can find the version of python for your cluster by running the below command in databricks.
%sh python --version
When installing Python, we do need to select installation of pip as well as the add python to environment variable options.
Other Prerequisites for Databricks-Connect
Install Conda (Anaconda or Miniconda). Install in all-users mode and ensure to add Environment Variables during installation.
Install Hadoop (instructions here). Hadoop installation helps us to suppress the warning(ARN NativeCodeLoader: Unable to load native-hadoop library for your platform) that we otherwise keep getting every time we run python commands from our local environment (which can be very annoying at times).
Now that we have all prerequisites done, we can use pip to install databricks-connect client in our command prompt. There is however, one last thing to be done, which is to uninstall pyspark.
pip uninstall pyspark
Following this we can run the below command for databricks-connect. I am using 7.1 version to match the databricks version of my cluster. You must install the version according to the databricks cluster that you have.
pip install -U databricks-connect==7.1.*
Next, we require to gather a few configuration values for setting up our environment.
- Databricks workspace URL -> the URL for the databricks host
- Databricks token -> The token for accessing databricks. You can generate one from user settings in the azure databricks environment.
- Databricks Cluster ID -> Can be obtained from advanced options in the configuration tab of your cluster.
- Port -> The default value for this would be 15001. Use this unless you have overridden this value by custom configuration of the cluster.
- a. Org ID -> Found in URL of the workspace, when opened from Azure Portal.
Once you have the values for the above configs, we are ready to setup databricks-connect using the below command.
Read and Accept the agreement that will be presented once you run the above command and then supply the values for the configurations that we gathered above and we are all set. You can test the connection using the below command.
Configuring Visual Studio Code
To use databricks-connect for python, we first need to create a virtual conda environment for your python version (which needs to be same as the python version of your cluster) from command prompt using the below command:
conda create — name dbconnect python=3.7.3
Also, we need to install the python extension in Visual Studio Code from the marketplace. I would recommend you not to install linter extension which Visual Studio Code prompts for once you install python as it can be very annoying when developing your code.
Next we need to set the virtual environment path for Python in VS Code. To get the value for this path, run the below command in cmd.
Go to settings in VS Code (in bottom left of the environment) and search for python.venv. You will see the option Python: Venv Path. You can paste the path you got above in this place.
You can test the connectivity by creating a python file and adding your spark code to the same. For running this code, however, we need two more things
- Importing the SparkSession by adding the below code to the top of your file. This is required for any code/library that we create for databricks. The notebooks we create in databricks automatically run this code internally.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
2. Run the code you created in the virtual environment we created before. This can be done by selecting the python interpreter at the bottom left of VSCode and selecting the virtual environment.
Creating Library for Databricks
Now that we have python code running in VS Code, we need to do a few additional things to create our library code.
First, we need to create the folder structure for our library. I created my library by the name DatabricksDemoLibrary. We need to create a folder for the same and open it using VS Code’s open folder option.
Since this folder is our library folder, we need to encapsulate our virtual environment in this folder. We can do this by opening the folder in command prompt and running the below set of commands:
python3 -m venv venv
pip install wheel
pip install setuptools
pip install twine
Now that we have set up our library environment settings, we can continue with setting up our folder.
First, we can create a setup.py file. This is one of the most important files when creating a Python library. This file contains details about the library like name, version, included libraries, etc. We shall cover this in a later section for better understanding.
Next, we can create a README.md file. This is the place where we can write describe the contents of our library for other users.
Next, we need to create 2 folders, one for our code and another for our test cases. I shall be naming them as DatabricksDemoLibrary and tests respectively. To indicate to our interpreters that these are modules for our library, we need to create a
__init__.py file in both these folders.
There is a range of options of what to put in an
__init__.py file. The most minimal thing to do is to leave the
__init__.py file totally empty. We can also use an
__init__.py only for adding various pieces of code from determining import order to define your entire package. While, I feel that importing a package is a good way to use this file, we should not add complexity to this file by adding a lot of code to this. For the purpose of this demo, I choose to keep this file empty.
Now that our basic folder structure is ready, we can go ahead and add our code/test files in these folders.
I went ahead and added a Demo.py file with the basic code below:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.sql(“CREATE TABLE IF NOT EXISTS DatabricksTest (Column1 string)”)
Configuring values for setup.py
from setuptools import find_packages, setupsetup(
description=’Python Demo Library’,
Understanding the values of setup.py.
- The find_packages(include=[‘a’,’b’]) here is defined to include the modules that we mention. If we do not give any parameter to the same, it picks up all modules that our code has.
- In install_requires, we can include the packages that we installed in our local for the code to work which are not part of the python library. We should ensure to use only essential packages in this variable.
- The setup_requires and tests_require contains list of packages required by our code and our test to run. Again, only include these if it is absolutely required for our code runs.
- The test_suite refers to the test module that we have created for our code.
Building our Library
To build a library, we can simply go to the root folder (the folder which contains the setup.py file) of the library from Command Prompt and run the below command to generate our .whl file.
python setup.py bdist_wheel
The .whl file is created in the dist folder in our project folder and can be directly imported and installed in our databricks cluster or can be used at notebook level by first placing it in our DBFS using the ‘Create Library’ option which comes up on right clicking in the workspace folder and then installing it using the below python command in our notebook.
Once installed, we can simply import and use it using the below command in our databricks notebook using the commands:
from DatabricksDemoLibrary import demo
While the process of creation of our library for Databricks can seem daunting in the beginning, this can help us achieve much more and can be beneficial to encapsulate our code and complexities as well as creating a piece of code which can be easily moved across multiple environments with a simple installation.