How to best describe Azure Data Factory (ADF) V2? A friend of mine described it as the ETL of the skies. And since I based almost my entire career on ETL tools, this sounded like music to my ears. After running a nice project in ADF I can now look back on this statement, and conclude there is some truth behind that statement, but it’s more an ELT then a ETL tool.
ADF is a data integration tool that can manage and orchestrate data streams between different Azure Data Component like Azure Blob Store, Data Lake Store, but also Data Lake Analytics and Azure Data Warehouse. It can even grab data from your on premise data sources in your company. The real transformation part isn’t its cup of tea however.
But hey, we got now other more powerful tools for this inside Azure, we can even use ADF as a runtime environment for SSIS packages, that will do all the lookup and transform work for you. Since its version 2 dress it also contains a graphical user interface. Too bad for all these Die Hard fans out there writing and debugging JSON code by hand, we can now use drag and drop as well :-)
This series will consist out of three blogposts. Below I will go deeper into Linked Services, the glue that you’ll need to keep all these Azure Services together. In a second blogpost I will explain how to load the data in our SQLtable using Polybase. In a third post we’ll be working with parameters and a parameter table, so you can make your workflow more dynamic.
I have my file in Azure Data Lake Store and I want to load it into my Azure Database using Data Factory.
In order to achieve this we need to set up a linked service between Azure Data Lake Store and Data Factory. You can think of Linked Services as being a connection string. And off course every connection string needs to be able to authenticate itself to the service. For this, we won’t be using our own user ID for this, but a service principal.
The service principal fits in the concept of Role-Based Access Control in Azure (RBAC). Role-based access control (RBAC) helps you to manage who has access to Azure resources, what they can do with those resources, and what areas they have access to. So you can define fine-grained what each user or group can see or do in your environment, instead of giving everyone full access. For communication between Azure applications Service Principals are used, think of it as a user identity for an azure resource. To the Service Principle you then add roles that define what exactly the service principle has access to, and the scope of it (Reader, Contributor, Owner).
The first step is to create an ADF Instance in your Azure Resource Group, and click on Author and Monitor to create your first project. Let’s skip the sissy wizards, and go straight to the real-men author section.
You go to Connections, Linked Service, Click New and choose to create a Linked Service to the Azure Data Lake Store.
Here you need to provide a Name, Choose your Azure Subscription and the Data Lake Store you want to connect too. As the authentication type, you choose to use the Service Principal.
Next up, populate the Service Principal ID and Service Principle Key. Microsoft here chose to use a cryptic definition of User ID and User Password of the service Principal.
I hereby assume you have not created a service principal yet, so I guide you through the process of creating one.
Go to Azure Active Directory
In the Azure Portal. There you choose for App Registrations:
There you choose for New App Registration.
Provide a new for you application connector. The Sign-on URL in this case can be given randomly, as it’s not important in this case. Important is that you choose to Web app / API as the Application type.
Your application will now be between the applications.
Click on TestAppKris and you come into the following screen.
The Application ID is the Service Principle ID asked to provide in the generation of the Linked Service. The password we will need to generate. Therefor you need to click on Settings and then on Keys.
In the Password screen, provide a new description for the password, and set the expire date.
You can choose between 1 year, 2 years and never expire.
If you now safe you will see the password for the key. This is the principal key. Store this very carefully.
So we now populate the principle ID and principal key in the Linked Service, case closed and we go home happily?
Not entirely.
[{“code”:9083,”message”:”Access to https://datalakekris.azuredatalakestore.net/webhdfs/v1/ is denied. Make sure the ACL and firewall rule is correctly configured in the Azure Data Lake Store account. Service request id: 34c05f82-e638-4717-a616-9be3aa7024ff Response details: {\”RemoteException\”:{\”exception\”:\”AccessControlException\”,\”message\”:\”Permission denied on / [34c05f82-e638-4717-a616-9be3aa7024ff][2018-06-18T06:23:25.4259617-07:00]\”,\”javaClassName\”:\”org.apache.hadoop.security.AccessControlException\”}}”}]
That’s a nasty error message. Thank you Microsoft.
The reason for the error is that you might have generated the service principal, but you did not specify that it can access our data lake store.
How can we do that?
In the portal we go back to the resource group, and to the data lake store we have our data in.
Click on the Access Control (IAM) and click Add.
Assign your service principal (here called TestAppKris) the role of contributor and click save.
All clear? Go back to the Azure Data Factory tab in your browser and test again?
What the heck in the world might be going on? I did all the steps exactly as the dude said, and still the computer said no… Fear not my friend, we have one more thing to do.
As it turns out. We now have clearance for Azure Data Factory to read/write into the Data Lake Store, but we can’t access the individual folders.
Go to Data Explorer and once there choose for Access.
Choose for Add, and next provide the name of the service principle app.
Click Select.
Click OK, and the permissions will be granted.
Looking a lot better J. Click Finish to create the linked service.
Connecting the Linked Service to the database
We now completed step 1 out of 3.
We still need to set up the connection to our Azure database, and pump the data over from our csv file. In this blogpost I will explain in detail how to set up the linked service to the Azure Database. Note that this could easily be an on-premise database as well if you wish to.
The connection to the database:
In ADF, you go back to Connections, New Linked Service and choose for Azure SQL Database.
Click Continue.
Populate the name of the linked service, choose your Azure Subscription, server name and database name of the Azure SQL Database.
As authentication type I choose for the SQL Authentication. Provide a valid User Name and Password for your SQL Database and click Finish.
If you click Finish you have a working connection to the Azure Database.
That wasn’t so hard, right?
© 2023 Kohera
Crafted by
© 2022 Kohera
Crafted by
Cookie | Duration | Description |
---|---|---|
ARRAffinity | session | ARRAffinity cookie is set by Azure app service, and allows the service to choose the right instance established by a user to deliver subsequent requests made by that user. |
ARRAffinitySameSite | session | This cookie is set by Windows Azure cloud, and is used for load balancing to make sure the visitor page requests are routed to the same server in any browsing session. |
cookielawinfo-checkbox-advertisement | 1 year | Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Advertisement" category. |
cookielawinfo-checkbox-analytics | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics". |
cookielawinfo-checkbox-functional | 11 months | The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional". |
cookielawinfo-checkbox-necessary | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". |
cookielawinfo-checkbox-others | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other. |
cookielawinfo-checkbox-performance | 11 months | This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance". |
CookieLawInfoConsent | 1 year | CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie. |
elementor | never | The website's WordPress theme uses this cookie. It allows the website owner to implement or change the website's content in real-time. |
viewed_cookie_policy | 11 months | The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. |
Cookie | Duration | Description |
---|---|---|
__cf_bm | 30 minutes | Cloudflare set the cookie to support Cloudflare Bot Management. |
pll_language | 1 year | Polylang sets this cookie to remember the language the user selects when returning to the website and get the language information when unavailable in another way. |
Cookie | Duration | Description |
---|---|---|
_ga | 1 year 1 month 4 days | Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors. |
_ga_* | 1 year 1 month 4 days | Google Analytics sets this cookie to store and count page views. |
_gat_gtag_UA_* | 1 minute | Google Analytics sets this cookie to store a unique user ID. |
_gid | 1 day | Google Analytics sets this cookie to store information on how visitors use a website while also creating an analytics report of the website's performance. Some of the collected data includes the number of visitors, their source, and the pages they visit anonymously. |
ai_session | 30 minutes | This is a unique anonymous session identifier cookie set by Microsoft Application Insights software to gather statistical usage and telemetry data for apps built on the Azure cloud platform. |
CONSENT | 2 years | YouTube sets this cookie via embedded YouTube videos and registers anonymous statistical data. |
vuid | 1 year 1 month 4 days | Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos on the website. |
Cookie | Duration | Description |
---|---|---|
ai_user | 1 year | Microsoft Azure sets this cookie as a unique user identifier cookie, enabling counting of the number of users accessing the application over time. |
VISITOR_INFO1_LIVE | 5 months 27 days | YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface. |
YSC | session | Youtube sets this cookie to track the views of embedded videos on Youtube pages. |
yt-remote-connected-devices | never | YouTube sets this cookie to store the user's video preferences using embedded YouTube videos. |
yt-remote-device-id | never | YouTube sets this cookie to store the user's video preferences using embedded YouTube videos. |
yt.innertube::nextId | never | YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen. |
yt.innertube::requests | never | YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen. |
Cookie | Duration | Description |
---|---|---|
WFESessionId | session | No description available. |