COS80023 - Task 8 Classifiers

Overview

With the help of given R commands, classify the entries in a simple dataset to familiarise yourself with two classifiers, SVM and random forest (RF).

Purpose

Learn to use R for classification. Practise the classification topics discussed in the presentation. Learn to interpret the quality of a classification result.

Task

Go through the steps described below. Answer the questions in a separate file.

Time

This task should be completed in your 8th tutorial or the week after and submitted to Canvas for feedback. It should be discussed and signed off in tutorial 9 or 10.

This task should take no more than 1 hour to complete (excluding introductory videos).

Resources

Lecture Presentation
Code listing with more explanations (below)
Any other material you find useful in explaining the results

Feedback

Demonstrate your steps and discuss your answers with the tutorial instructor.

Next

Get started on module 9.

Pass Task 8 — Submission Details and Assessment Criteria

Follow the steps below and answer the questions in a separate file, then upload to Canvas as a PDF. Your tutor will give online feedback and discuss the tasks with you in the lab when they are complete.

Task 8

Exercise 1

Run these lines in your RStudio. They load the Iris dataset, partition it into training and testing sets and specify stratified 10-fold crossvalidation.

library("caret")

(This loads the caret library which has functions that we need. If you need to know what a function does, you can ask for help using ?, e.g. ? traincontrol()).

iris.data <- read.csv(file.choose()) View(iris.data)

iris.data$species <- as.factor(iris.data$species) set.seed(32984)

indexes <- createDataPartition(iris.data$species, times = 1,

p = 0.7, list = FALSE)

train <- iris.data[indexes,] test <- iris.data[-indexes,]

trainctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

Here you train the SVM model (with a linear kernel, hence “svmLinear”) and then you apply the trained model to the test set. svmlin shows you the accuracy of the classification on the training dataset, confusionmatrix shows you the accuracy on the test dataset.

svmlin <- train(species ~., data=train, method="svmLinear", trControl=trainctrl,

preProcess = c("center", "scale"), tuneLength=10)

svmlin

svmresult <- predict(svmlin, newdata=test) svmresult

confusionMatrix(table(svmresult, test$species))

Question 1. Does this model suffer from overfitting? How can you tell?