Question 3 (10 marks)
Use the dataset yelp1000.xlsx to answer this question. The dataset contains the following columns: date, stars, text (restaurant text reviews), cool, useful, and funny.
- Create a pie plot of the number of reviews (texts) for each type of star rating. Find the most frequent word used (excluding stopwords) for each type of star rating.
- Obtain sentiment scores (compound, neg, neu and pos) of each review (text). Report the most positive text and the most negative one. How many texts are with neu score equal to 1?
- Find the interquartile range of compound for each stars group (1 to 5) and make a boxplot of compound using different color for each stars group. Comment on the outcomes.
- Find the total number of reviews in 2011 and 2012, respectively. Find the proportion of compound below zero in 2011 and 2012, respectively. Compute the 90% confidence interval of the proportion of compound below zero in 2011 and 2012, respectively. Comment on the results.
- Use "apply(len)"to create a new column called "length" which is the number of words in the text column. Test if 60% (60 per cent) of the reviews (texts) are with more than 500 words (including space). State the null hypothesis and the alternative hypothesis. Comment on the results.
Get Your Customize Task on any
subject starting 10$/Page