COS80023 - Pass Task 3: Extraction, Transformation and Loading

Overview

Practise ETL with an example that starts from a .csv (comma-separated table) file and use a Hadoop cluster and Apache Hive to transform it by applying the relational model and storing it in a relational database with the help of Apache Sqoop.

Purpose

Demonstrate an understanding how to use Hadoop tools to automate the transformation of data into the relational model.

Task

Carry out the tasks described below and answer the questions in your submission.

Time

This task should be completed in the fourth lab class or before and submitted to Doubtfire for feedback. It should be discussed and signed off in tutorial 3 or 4.

This task should take no more than 2 – 2 ½ hours to complete.

Resources

Presentation (from Blackboard)
How to use Azure Data Lake Storage Gen2: https://docs.microsoft.com/fi- fi/azure/hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2
How to create a HDInsight cluster (not all settings you need are the same in this document): https://docs.microsoft.com/fi-fi/azure/hdinsight/hadoop/apache- hadoop-linux-create-cluster-get-started-portal
Tutorial on Hadoop and Apache Hive: https://docs.microsoft.com/fi- fi/azure/hdinsight/hadoop/apache-hadoop-linux-create-cluster-get-started-portal
Any other online material

Feedback

Discuss your answers with the tutorial instructor.

Next

Get started on module 4.

Pass Task 3 — Submission Details and Assessment Criteria

Write down the questions and answers in a text or Word document and upload to Doubtfire. Your tutor will give online feedback and discuss the tasks with you in the lab when they are complete.

Task 3.1

Create a Data Lake Gen2 on Azure, upload a data file in tabular format, clean it and upload it to a relational database using Hive and a Hadoop cluster on Azure.

The process has the following steps:

Creating a user-managed identity to allow the Hadoop cluster to access the data lake
Creating a Data Lake Gen2 account on
Setting up permissions for the managed identity on the Data Lake Storage Gen2
Creating an SQL Database
Creating a Hadoop cluster, using HDInsight
Upload the data and staging script
Transforming the data using Hive
Loading the data using sqoop

1. Creating a Managed Identity

A managed identity is a concept that avoids having to use passwords for authentication to different services. In this exercise, we are using a managed identity to authorise Hadoop Hive to work on a Data Lake we create. Managed identities are stored in Active Directory (AD). AD is a repository of users and their passwords that lets people use the same user name and passwords for several services. We have a similar setup at Swinburne. You can access all your services at Swinburne with one SIMS password (Please do not use your SIMS password for these exercises.).

Type Managed Identity in the search field in the top middle of the portal. Click on the

entry with the same name under ‘Services’ to start creating a managed identity. Press +Create or click on Create managed identity.

Select the correct subscription (containing COS80023) and your resource group(your student number).

Select Australia East as your region.

In the resource name field, type <yourstudentnumber>managedidentity. Click Review+Create.

Observe the notifications (bell-shaped icon). Once it tells you the resource has been created, click on Refresh. You should now see your new managed identity.

Question 1: Follow this link: https://www.blue-granite.com/blog/bid/402596/top-five- differences-between-data-lakes-and-data-warehouses

Answer the questions, what is a data warehouse, and how does a data lake differ from it?

2. Creating a Data Lake Gen2 on Azure

To use tools on your data, you have to store it in a data store that Hadoop can access. There are two options, Data Lake and Azure Storage. We will use Data Lake Gen2.

In your Azure Portal, go to Storage Accounts. Press +Create or click on the Create storage account in the middle.

Choose the correct subscription and resource group. Name your new storage <yourstudentnumber>datalake. Set the location to Australia East.

Leave the default for all other settings on this tab.

Move to the Advanced tab. Enable the Data Lake Storage Gen2 hierarchical namespace.

Create the storage account.

Set up permissions for the managed identity on the Data Lake Storage Gen2 Select your data lake (storage account), then click on Access Control (IAM). Click on the Role Assignments Click on +Add. Choose Add Role Assignment.

Select the Storage Blob Data Owner role.

In the box Assign role to, choose User assigned managed identity.

Click +Select members. In the Select managed identities section under Managed identity select “User-assigned managed identity”, in the select box, choose your managed identity that you created earlier and click select.

Click Review+assign.

You have created an identity that has access to your data lake.

4. Create an SQL Database

(This is repetition from last week)

Go to your Dashboard. Choose SQL Databases under Azure Services(top middle). Click

+Create above the (empty listing), or the ‘Create SQL database’ in the middle of the

screen.

Select the correct subscription (containing COS80023) and your resource group.

Name your database <yourstudentnumber>rdb.

Create a new server <yourstudentnumber>server. Choose Australia East as a location. Leave the default authentication method untouched.

Choose user<yourstudentnumber>, e.g. user12345678 as you admin user. (This cannot start with a number).

Choose a strong password that is more than 8 characters long, has special characters and upper and lower case letters. (The system will tell you immediately whether a password is acceptable).

Click Ok to create the server.

Choose this <yourstudentnumber>server for your database. Tick no to using an Elastic pool.

Leave the other options untouched.

Review the settings and click ‘Review+Create’.

When the database has been deployed, click SQL Databases in the leftmost navigation pane. Choose the <yourstudentnumber>rdb. Click on ‘Overview’ in the menu on the left. Above the entry of your server, there is an option Set server firewall. Click it.

Click “Selected networks” option.

Under ‘Rule’ field click Add your client IPv4 address option and save.

Alternatively, you can click +Add a firewall rule and then in the ‘Rule name’ field, type

sp_set_firewall_rule. Type your IP address into both the Start IP and End IP fields.

Important: Tick Allow Azure services and resources to access this server. (This is different from last lab).

Now you can create a table (as you did last week, by logging into the query editor). The table structure is this:

CREATE TABLE [dbo].[accidents]( [day_week_description] [nvarchar](50) NOT NULL, [no_of_vehicles] float,

CONSTRAINT [PK_accidents] PRIMARY KEY CLUSTERED

([day_week_description] ASC))

This is where our ETL result eventually goes.

5. Create a HDInsight Cluster

In the search field top centre of the page, type HDInsight. Choose HDInsight clusters from the options.

Click on ‘Add’.

Basic tab

In the ‘Basic’ tab, set the cluster name to s<yourstudentnumber>cluster (no upper case letters allowed), e.g. s12345678cluster.

Choose the subscription and resource group selected in previous tasks earlier. Choose Australia East as location.

Choose Hadoop 2.7.3 as cluster type. Leave the default cluster username and ssh user. Choose a password with upper case, lower case, numbers and a special character.

Leave the admin users for cluster login and ssh as is. Provide a cluster password (you can use the same password throughout this exercise).

Click Next to proceed to Storage. Choose Azure Data Lake Storage Gen2. Choose the

<yourstudentnumber>datalake you created before.

Change the file system name to <yourstudentnumber>filesystem. Select the identity your created earlier.

Click Next.

Security and Networking

In the User-assigned managed identity field, choose the identity you created earlier. If not visible then type it yourself.

Leave all other options as they are. Click next.

Application tab

Leave all defaults and click Next.

Cluster size

Choose the smallest possible options. No need to add script action.

Click Review+create.

Summary

On the summary page, you get to create the cluster after the settings have been validated. It typically takes a few minutes for the cluster to be up and running.

6. Upload the data and staging code

Examine the staging script

The resource file on Doubtfire is reporting.zip. Unzip it on your harddrive. It contains reporting.csv and staging.hql.

First, open reporting.csv and find out what data we are working with. What information does the file contain?

The file staging.hql is written in HiveQL, but if you can read SQL, you can guess what the file is doing. We will be studying the theory of Hive a little later. For now, just open the script in a text editor and examine it, to answer the following question:

Question 2: What is the script going to do if you run it?

Run the staging script

In the script staging.hql, change the name of the file system to your file system on Azure, and the name of the data lake to the name of the data lake you created. Be careful to use the correct names.

After this, rezip staging.hql and reporting.csv into a file called reporting.zip. Do not include a directory.

Open a terminal (cmd on Windows) and upload the data from Doubtfire onto the cluster using scp (secure copy protocol).

You have to navigate to the directory on your own file system where you have stored the

.zip file from Doubtfire. Use the cd command to navigate.

(E.g. cd .. takes you up one level in the file structure, cd data takes you into a subdirectory called data.)

scp reporting.zip sshuser@s<yourstudentnumber>cluster- ssh.azurehdinsight.net:reporting.zip

You will be prompted for a password. Type the password you specified when you created the cluster.

Connect to the cluster using ssh (secure socket shell).

ssh sshuser@s<yourstudentnumber>cluster-ssh.azurehdinsight.net

If your connect string and password were correct, you now see a command prompt:

sshuser@hn0-s12345:$

This means you are talking to the head node (hn0) of your cluster. You can now issue commands to the cluster. First, you unzip your data:

unzip reporting.zip

As you are now working on a Linux system, you can check the content of the current directory by running:

ls -l

You should see the three files reporting.zip, staging.hql and reporting.csv now. (If you do not, you have to troubleshoot before continuing.)

Use the following command to create the Data Lake Storage Gen2 file system.

abfs://<yourstudentnumber>filesystem@<yourstudentnumber>datalake.dfs.core.w

indows.net/

Use the following command to create a directory.

hdfs dfs -mkdir -p abfs://<yourstudentnumber>filesystem@<yourstudentnumber>datalake.dfs.core.w indows.net/accidents/data

Change the command to create a second directory called script, which also is a subdirectory of accidents.

Use the following command to copy the reporting.csv file to the directory data: hdfs dfs -put "reporting.csv"

abfs://<yourstudentnumber>filesystem@<yourstudentnumber>datalake.dfs.core.w indows.net/accidents/data/

Check if the file is really there:

hdfs dfs –ls abfs://<yourstudentnumber>filesystem@<yourstudentnumber>datalake.dfs.core.w indows.net/accidents/data

Change the hdfs dfs –put command to put the staging.hql file into the script

directory.

7. Transform the Data

To run the staging.hql script, we have to call Hive:

beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http' -f staging.hql

You should have an idea what the script is doing at this stage. If you don’t, discuss with

a mate or your tutor.

After the staging.hql script has finished running, use the following command to open an interactive Hive session:

beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http'

You should see the following prompt:

Now you can extract data using:

INSERT OVERWRITE DIRECTORY '/accidents/output' ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'

SELECT regexp_replace(day_week_description, '''', ''),

sum(no_of_vehicles) FROM accidents_in_hive

WHERE no_of_vehicles IS NOT NULL GROUP BY day_week_description;

Question 3: What is the ‘accidents_in_hive’ object that comes after the ‘FROM’

keyword? What information, do you think, the query is extracting?

Of course this is no fun if you can’t check the outcome of the HiveQL. Use the

hdfs dfs –ls

command you used before to inspect the content of the output directory. You should find a file there. Use this command:

hdfs dfs –cat abfs://

<yourstudentnumber>filesystem@<yourstudentnumber>datalake.dfs.core.windows. net/accidents/output/<file_name_you_just_found>

to display the content of the file. Is this what you expected to see when you examined the HiveQL you ran interactively?

8. Loading the data

In this final step, you load the data you just extracted into a Hive file into the SQL Server database you created earlier. We are going to use sqoop for this.

First, check if sqoop can see the database and table you created earlier:

sqoop list-databases --connect jdbc:sqlserver://<yourstudentnumber>server.database.windows.net:1433 -- username user<yourstudentnumber> --password <yourstrongpassword>

You should see the rdb database you created earlier. If you do, you are good to load the data:

sqoop export --connect 'jdbc:sqlserver://<yourstudentnumber>server.database.windows.net:1433;datab ase=<yourstudentnumber>rdb' --username user<yourstudentnumber> --password

<yourstrongpassword> --table 'accidents' --export-dir 'abfs://<yourstudentnumber>filesystem@<yourstudentnumber>datalake.dfs.core. windows.net/accidents/output' --fields-terminated-by 't' -m 1

Now you should be able to go back to your Azure Portal and find the SQL database created at the start. Go back to the Query Editor you used before when you created the accidents table.

Run a select-all query:

SELECT * FROM accidents;

Question 4: On what weekday do the most accidents happen? How many vehicles are involved? Document the answer with a screenshot.

Question 5: What steps and tools were involved in this process? Draw a diagram that shows the tools and files used. You can draw in Powerpoint or by hand and take a photo to document the workflow.