7318AFE – Business Data Analytics Assignment Help
Assignment Guide
- Answer ALL questions.
- Total marks: 40.
- Data for this assignment are available from the course L@G website.
- Use Jupyter Notebook/Python to achieve the answers.
- Justification is required if your (Python) coding is beyond what have been taught in the class – lack of clear justification will be marked off (even if the results are correct).
- Type up your answers in a Word file (you may copy & paste some of the Python outcomes to the file).
- Assignment submission: upload two files – [1] the Word file [2] the Jupyter Notebook/Python file (either pdf (preferred) or ipynb).
- Note that the Word file should be self-contained (answers + supported Python outcomes + discussion/interpretation).
- You must complete the Academic Integrity Declaration for online exams & safe assignment check before submission.
- Late submission without approval is subject to penalty.
- Due: 1pm (Friday) 09 October 2020.
Question 1 (10 marks)
Use data from H2019.csv to answer this question. The dataset has the following variables
- Country: country name
- Hscore: happiness score (higher score implies higher degree of happiness)
- logGDP: log GDP per capita
- Freedom: freedom to make life choice (higher score implies higher degree of freedom)
- GQuality: Government Quality (higher score implies higher degree of government qaulity)
- Classify countries with logGDP larger than 9.9 as ‘Rich’ countries & all others as ‘not-Rich’ countries. How many countries are in the Rich group? Test if the mean of Hscore for Rich countries is the same as it is of not-Rich countries. Use the 6 steps of hypothesis testing to report the testing outcomes.
- Find the 99% confidence interval (CI) of the mean of GQuality of Rich and Not-Rich countries, separately. Present the CI formula and interpret the results.
- Report the OLS regression results of Hscore on logGDP, Freedom, and GQuality with statsmodels for all countries. Interpret the estimated slope coefficients (the partial effects) and the R-squared.
- Obtain the predicted Hscore based on the regression result of Part 3. Is the country with highest predicted Hscore the same as the one with highest acutal HScore? Which country’s Hscore is most over-predicted? Which country’s Hscore is most under-predicted?
- Extend the regression model in Part 3 to assess if logGDP and GQuality show a synergy (complementary) effect to Hscore? Comment on the results.
Question 2 (10 marks)
Load Covid19 data from the following link https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv and use the data to answer this question. Obtain data till the end of September (2020-09-30). (Alternatively, use covid0930.csv.) Note that the three numerical variables (Deaths, Confirms, and Recovered) in the dataset are accumulated figures (i.e. ‘total’ numbers up to date).
- Find countries with total (accumulated) deaths over 15000 and obtain the first date for each of these countries when it happened. Find the dates that the total covid death number globally surpass 500000 and 1000000, respectively.
- Compute the global daily death rate (i.e. “Deaths” divided “Confirmed” across all countries in each day) and plot it in a line chart. Identify the period that the overall daily death rate is higher than 4%.
- Find countries with at least one day having more than 20000 new confirmed cases. How many countries have never experienced over 100 daily new confirmed cases?
- Find ‘increased deaths’ globally during each of the following months: March, April, May, June, July, August, and September (i.e. new deaths in each month) and make a bar plot to present these 7 monthly death figures. Comment on the plot.
- Make 3 line plots (in one graph) of daily new confirmed cases in Australia: original, 7-day rolling (moving average), and 14-day rolling. Comment on the plot and identify the peak date of each series (line).
Question 3 (10 marks)
Use the dataset yelp1000.xlsx to answer this question. The dataset contains the following columns: date, stars, text (restaurant text reviews), cool, useful, and funny.
- Create a pie plot of the number of reviews (texts) for each type of star rating. Find the most frequent word used (excluding stopwords) for each type of star rating.
- Obtain sentiment scores (compound, neg, neu and pos) of each review (text). Report the most positive text and the most negative one. How many texts are with neu score equal to 1?
- Find the interquartile range of compound for each stars group (1 to 5) and make a boxplot of compound using different color for each stars group. Comment on the outcomes.
- Find the total number of reviews in 2011 and 2012, respectively. Find the proportion of compound below zero in 2011 and 2012, respectively. Compute the 90% confidence interval of the proportion of compound below zero in 2011 and 2012, respectively. Comment on the results.
- Use “apply(len)” to create a new column called “length” which is the number of words in the text column. Test if 60% (60 per cent) of the reviews (texts) are with more than 500 words (including space). State the null hypothesis and the alternative hypothesis. Comment on the results.
Question 4 (10 marks)
Use loans1000.csv for this question. There are 6 variables in the dataset (see below) — it is of interest to use [1]-[5] to predict [6]. Set random_state=1234 throughout (when required).
- [1] credit.policy: 1 if the customer meets the credit underwriting criteria, and 0 otherwise.
- [2] log.annual.inc: The natural log of the self-reported annual income of the borrower.
- [3] dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
- [4] fico: The FICO credit score of the borrower.
- [5] delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
- [6] not.fully.paid: 1 if not fully paid the loan, and 0 otherwise.
- How many borrowers do not pay the loan fully? Are the mean values of FICO credit score significantly different between fully-paid borrowers and not.fully.paid borrowers? Justify.
- Apply KMeans clustering using [1]-[5] (i.e. credit.policy, log.annual.inc, dti, fico, and delinq.2yrs). Find the optimal number clusters (set the maximum number of clusters as 10) without scaling. Justify your answer.
- Form 2 clusters (with KMeans) and use the crosstab to examine if the clustering outcome is in line with borrower’s “not.fully.paid” status. Comment on the results.
- Create a random partition of the loans1000.csv dataset with 70% of observations in the training set and the remaining 30% in the test set. Report the sample mean and standard deviation for each of the variables in both train and test sets, separately. Comment on the results.
- Based on the data split of Part 4, apply Decision Tree, Random Forest, and Gradient Boosting (with n_estimators=1000) to the Train set and use the obtained models to predict the Test set. Use accuracy, precision, and recall to evaluate the performance of these models. Comment on the results.
For Complete Assignment Solution