kohera-logo-regular.svg

Managing files from other devices in a Fabric Lakehouse using the Python Azure SDK

In this blogpost, you’ll see how to manage files in OneLake programmatically using the Python Azure SDK. Very little coding experience is required and yet you can create very interesting automations this way.

I started testing this approach when I was creating a proof-of-concept. I was working on a virtual machine where database tables were exported to Parquet files on a daily base. These Parquet files needed to be moved to OneLake. However, at the time of writing the on-premise gateway for pipelines in Fabric was not yet fully supported.

Instead of using a pipeline, I started thinking about uploading the files directly to OneLake from the machine and then proceed in Fabric for further processing.

The process for doing this is described below.

Authentication and authorization configuration in Azure and Fabric

Step 1: Create an app registration (service principal) in Azure and create a client secret.
Write down the tenant_id, client_id and client_secret. A detailed guide on how to create the service principal can be found here: https://learn.microsoft.com/en-us/entra/identity-platform/quickstart-register-app

Step 2: Add the service principal as contributor to your Fabric workspace.

Manage OneLake from other devices using Python

Establish the connection to OneLake

Step 1: First create a JSON file with the Service Principal credentials. I put this file in the config folder of my project

{

“tenant_id”: “<tenant_id>”,

“client_id”: “<client_id>”,

“client_secret”: “<client_secret>”

}

 

Step 2: Next, import the dependencies. Make sure they’re installed in the Python environment you’ll be using. From the Azure SDK we import the package to manage the datalake (OneLake) and we also import the package for authenticating with the Service Principal using a client secret. Lastly, the JSON package is imported to read our config file created in step 1.

from azure.storage.filedatalake import DataLakeServiceClient

from azure.identity import ClientSecretCredential

import json

 

Step 3: Make a Credential object for the Service Principal.

config = json.load(open(“config/service_principal.json”))

credential = ClientSecretCredential(

tenant_id=config.get(‘tenant_id’),

client_id=config.get(‘client_id’),

client_secret=config.get(‘client_secret’)

)

 

Step 4: Put the name of the workspace and the lakehouse in variables so it can be easily reused.

workspace = ‘<Name of the fabric workspace>’

lakehouse = ‘<Name of the lakehouse in the fabric workspace>’

files_directory = ‘<Name of the folder under files in the fabric lakehouse>’

 

Step 5: Create a DataLakeServiceClient object. This is an object at the OneLake level. Next use this object to create a FileSystemClient object. The FileSystemClient is on the workspace level. Once we have this, the preparation is done. Now we can start doing stuff. 😊

service_client = DataLakeServiceClient(account_url=”https://onelake.dfs.fabric.microsoft.com/”, credential=credential)

file_system_client = service_client.get_file_system_client(file_system = workspace)

 

Playtime!

Below are some examples of what we can do with our FileSystemClient object.

Example 1: List all the folders starting from a specific path in OneLake

paths = file_system_client.get_paths(path=f'{lakehouse}.Lakehouse/Files/{files_directory}’)

for path in paths:

print(path.name)

 

Example 2: Create a new (sub)folder on OneLake

new_subdirectory_name = ‘test’

directory_client = file_system_client.create_directory(f'{lakehouse}.Lakehouse/Files/{files_directory}/{new_subdirectory_name}’)

 

Example 3: Upload a file to OneLake

vm_file_path = r’C:\test\onelake\vm_test.csv’

onelake_filename = ‘onelake_test.csv’

directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test’)

file_client = directory_client.get_file_client(onelake_filename)

with open(file=vm_file_path, mode=”rb”) as data:

file_client.upload_data(data, overwrite=True)

 

Example 4: Download a file from OneLake

onelake_filename = ‘onelake_test.csv’

vm_file_path = r’C:\test\onelake\download_onelake_test.csv’

directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test’)

file_client = directory_client.get_file_client(onelake_filename)

with open(file= vm_file_path, mode=”wb”) as local_file:

download = file_client.download_file()

local_file.write(download.readall())

 

Example 5: Append to a CSV file on OneLake

onelake_filename = ‘onelake_test.csv’

text_to_be_appended_to_file = b’append this text!’

directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test’)

file_client = directory_client.get_file_client(onelake_filename)

file_size = file_client.get_file_properties().size

file_client.append_data(text_to_be_appended_to_file, offset=file_size, length=len(text_to_be_appended_to_file))

file_client.flush_data(file_size + len(text_to_be_appended_to_file))

 

Example 6: Delete a file from OneLake

onelake_filename = ‘onelake_test.csv’

directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test’)

file_client = directory_client.get_file_client(onelake_filename)

file_client.delete_file()

 

Example 7: Delete a directory from OneLake

directory_client = file_system_client.get_directory_client(f'{lakehouse}.Lakehouse/Files/{files_directory}/test’)

directory_client.delete_directory()

The data is in OneLake. What’s next?

Let’s go back to the setup of my proof-of-concept. On my virtual machine I uploaded the exported tables to a folder on OneLake named ‘Exports’. Each table had its own subfolder. It looked like this:

Files/

Exports/
Table1/*.parquet
Table2/*.parquet

To expose the data in Lakehouse Tables, I made a very simple pipeline. It starts by creating a list of all the exported tables by looking at the exports folder. Next, a foreach loop is initiated with a copy activity to copy the data into a Lakehouse table. In my setup, the tables are overwritten in every load.

The structure of the pipeline is shown below:

Conclusion

As you can see, it’s very easy to manage files on OneLake with Python, especially for people who have used the Python Azure SDK for data lakes before. OneLake can be approached in an unthinkable number of ways. So there is a lot of room for creativity when developing your architectures. I would definitely recommend playing around with it. Have fun!

Author

Johan Hostens, Data Wizard at Kohera.

Group of computer programmers working in the office. Focus is on blond woman showing something to her colleague on PC.
Updating your Azure SQL server OAuth2 credentials in Power BI via PowerShell for automation purposes
The better way to update OAuth2 credentials in Power BI is by automating the process of updating Azure SQL Server...
2401-under-memory-pressure-featured-image
Under (memory) pressure
A few weeks ago, a client asked me if they were experiencing memory pressure and how they could monitor it...
2402-fabric-lakehouse-featured-image
Managing files from other devices in a Fabric Lakehouse using the Python Azure SDK
In this blogpost, you’ll see how to manage files in OneLake programmatically using the Python Azure SDK. Very little coding...
2319-blog-database-specific-security-featured-image
Database specific security in SQL Server
There are many different ways to secure your database. In this blog post we will give most of them a...
kohera-2312-blog-sql-server-level-security-featured-image
SQL Server security made easy on the server level
In this blog, we’re going to look at the options we have for server level security. In SQL Server we...
blog-security_1
Microsoft SQL Server history
Since its inception in 1989, Microsoft SQL Server is a critical component of many organizations' data infrastructure. As data has...