COS80023 - Task 4: Parallellisation with MapReduce

Overview

Practise the principle of MapReduce and try out an example on Azure HDInsight.

Purpose

Demonstrate an understanding of the potential of MapReduce in speeding up tasks on big data sets.

Task

Carry out the tasks described below and answer the questions in your submission.

Time

This task should be completed in the fourth lab.

This task should take no more than 2 hours to complete.

Resources

Presentation (from Blackboard)
MS Azure tutorial for the creation of clusters: https://docs.microsoft.com/en- us/azure/hdinsight/hadoop/apache-hadoop-linux-tutorial-get-started (If you use the link to quickstart on this page, you will not see all the options discussed in the task)
This may help clarify the connection between the Hadoop cluster and the Azure Storage: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use- blob-storage
Any other online material

Feedback

Discuss your answers with the tutorial instructor.

Next

Get started on module 5.

Pass Task 4 — Submission Details and Assessment Criteria

Write down the questions and answers in a text or Word document and upload to Doubtfire. Your tutor will give online feedback and discuss the tasks with you in the lab when they are complete.

Subtask 4.1

Run the wordcount MapReduce code already on Azure to count the words of a file you choose and upload with the following steps:

Create a Hadoop cluster and
Use Azure Storage account and upload your file
Use ssh to connect to the cluster and analyse the file using Use it to count the words in a file that you choose.

1. Creating a HDInsight Cluster for MapReduce

As you know, in Hadoop tasks run on a cluster of nodes. First, you have to create the cluster.

Assuming you are already logged in, go to your dashboard. In the search field top centre of the page, type HDInsight. Choose HDInsight clusters from the options.

Click on ‘+Create’.

Select correct subscription (containing COS80023) and your resource group.

Set the cluster name to s<yourstudentnumber>cluster (no upper case letters allowed),

e.g. s12345678cluster.

Choose Australia East as location.

Choose Hadoop 3.1 as cluster type. Leave the default cluster username and ssh user. Choose a password with upper case, lower case, numbers and a special character.

Question 1: Do you think the choice of location matters? Why/why not?

Click Next to proceed to Storage. Select Azure Storage. Click Create new. Name your new storage <yourstudentnumber>storage.

For the container choose <yourstudentnumber>container. Leave the other options as default.

Click Next to proceed to Security and Networking. Do not change the default options.

Click Next to proceed to Configuration+pricing. Examine the default resources for the cluster. There are head nodes, Zookeeper nodes and worker nodes.

You can not change the number of nodes for first 2 options but you must change number of nodes to 2 for worker node.

For the nodes choose the following options

For Head node select E2 V3(2 Cores, 16 GB RAM)
For Zookeeper select A1 v2(1 Cores, 2 GB RAM)
For Worker node select A5 (2 Cores, 14GB RAM) make number of nodes as 2

Observe the information about available cores in Australia East.

Question 2: How many cores are available in total in this area? Did you expect more/less?

Do not make changes in Script actions section.

Click Review+create. On the summary page, you get to create the cluster. It typically takes a few minutes for the cluster to be up and running.

To find out about the progress (and possible errors), click on notifications on the top right (bell-shaped icon).

2. Using Storage on Azure

To analyse a file using MapReduce, you have to put the file where MapReduce can find it. There are two options, Data Lakes and Azure Storage. We will use Azure Storage that we have created beforehand.

Go to the storage account when it has been created. Click on Storage browser (preview).

Click on Blob Container. You should see container you created earlier. Click on it. Click upload and find the file you want to use to count the words of on your file system. This is what the dashboard should look like:

3. Running wordcount

When the deployment has completed in progress, open a command window and type (or copy) the command for an ssh connection:

ssh sshuser@<yourstudentnumber>cluster-ssh.azurehdinsight.net Type the password. If you type it correctly, you will see:

This is to tell you that your computer has never had any dealings with this host and does not recognise its signature. You can safely say yes.

Invoke the wordcount example already on HDInsight:

yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce- examples.jar wordcount

Question 3: What does the interface tell you? How do you think you can fix this?

The file you are using should be here:

wasb://<yourstudentnumber>container@<yourstudentnumber>storage.blob.core.window s.net/<yourfilename>

You can use this as an output directory:

wasb://<yourstudentnumber>container@<yourstudentnumber>storage.blob.core.window s.net/output

Question 4: What does the wasb prefix mean, and how does it relate to HDFS?

If the wordcount example runs successfully, it creates a file called part-r-00000 (it would create more files with different numbers if the input file was bigger).

Show the part-r-00000 file on the command line. Use the command:

hdfs dfs -cat wasb://<directory-path>/output/part-r-00000

Take a screenshot of the command and the beginning of the file and put it into your answer file for Doubtfire. Example: