Oh, this is cheesy, can’t believe we could ever dare to mention Mariah Carey in our blogpost series. Well, the holiday season is starting, so there is no escape at all 😊.
What I meant with the title was: All I need for Christmas is you… Azure Data Flow. Data what? Azure Data Flow is a new functionality in Azure Data Factory. For more info on Azure Data Factory itself I would like to refer to my series of blogposts published last July on our website concerning Azure Data Factory. Please follow this link here, and don’t forget to come back afterwards.
But back on topic, we were talking about Azure Data Flow. As said, Azure Data Flow is a new functionality, that still has to be publicly released by Microsoft. Currently I’m as excited as a child waiting for Santa Claus (hence my title) to play with it on a customer’s site because , I got a private preview access to it, and that means that I can already share some things with you. How did I manage to do that, well let’s say that I know people who know people…
What’s all the fuzz about now? To be able to explain this, I’ll have to go back half a year. We were happily writing our pipelines in Azure Data Factory V2, using ADLA as an engine to clean / transform data while it was travelling down the road. ADLA uses U-SQL, a combination of SQL and C# that we as ‘classical’ ETL developers managed to pick up fairly quickly as it was within the ranges we felt comfy in. Happy days…
But then a new kid came on the block. The kid already made itself quiet a reputation in other parts of town, and quickly proved what it is worth in ADF as a welcome addition to our stack. An even more powerful way to transform data in the cloud. Enter Data Bricks. Because of its flexibility All eyes went to Data Bricks and suddenly, ADLA was no longer any good, it even didn’t got a place under the sun when using the Data Lake Gen2.
Okay, as Azure consultants, we love evolutions in technology. What else would we do with our weekends if we couldn’t read blogs anymore 😉. Big problem: This new kid talks Python and Scala, languages that grew out of Perl. And let’s be honest that’s very far out off our radars. So, we panicked, we even screamed for our mothers. Unfortunately, they couldn’t help us either (at least mine thought I became crazy, a python coming out of a pearl, on a Spark ??? I quickly left before the nurse came.). It became even more fun when we tried to explain this to our BI customers. We have this great new powerful tool, but it only speaks Python… Yeah, complicated… Mostly there was some 50+ year old guy in the room that raised his hand and suddenly felt all in fashion again. I know that shit… Python and flat files, they love it 😊.
Enter Azure Data Flow
Azure Data Flow allows you to build data transformation logic using a graphical interface. The way of building your logic is very similar to when we did it back in SSIS. But instead using the very powerful Data Bricks engine underneath. All this without having to write Python or Scala code yourself? Yes Sir! Or Madam 😊. I, at least, when being told on DataMinds, was all ears. Because SSIS-like component on Azure, that’s right in my shop. I know that shit…
Does that mean that you will never ever have to write Python code again when using Azure Data Bricks? Probably not, it’s still early days, and let’s see how far this new tool can bring us. But at least you got your basics covered. And maybe one day…
I promised you a sneak preview. You get a sneak preview.
This is the Microsoft Taxi Demo Data Flow:
As you can see, we are joining importing data, joining it, aggregating it and sinking it while it is travelling down the pipeline.
For the join you can choose between an inner, left, right, full outer, and even a cross join.
I can even go into debug mode and see a preview of my Data. I can optimize the performance of my flow by defining partitions and choosing to maintain one part of my join in memory. When I have set all this, I can inspect my settings and see how each individual column is used and handled by the join component.
And this is the aggregator, where we first define our group By
And then our aggregate clause
Pretty cool isn’t it? I personally think it is. Did you know by the way the expression language used in Azure Data factory, Azure Data Flow and Azure Logic Apps is shared among them? So, if you are familiar with the expression language of Azure Logic apps, you already know Azure Data Factory and the still in private preview Azure Data Flow as well.
Another Data Flow to show. Here a union happens between 2 sources, then we create a calculated column to calculate the new currency rate.
SSIS anyone? But this time with an oh so powerful workhorse underneath the hood.
Having built my Azure Data Flows, I can integrate them in my regular ADF V2 pipeline.
Under Move & Transform I choose the Data Flow component, I hook it up to the Data Flow I just created and link it to a Data Bricks node and an Azure blob storage containing my data. And we are good to go. Not a single line of Python code created 😊. Sweet…
All I want for Christmas is Azure Data Flow. And I really mean it!