Jupyter Notebook/Python Data Questions & Answers
Question 1
- The formula used for mean conventional interval estimation is:
where is the mean, is the confidence coefficient, α is the confidence level, σ is the standard deviation and n is the sample size.
The result of conventional 95% interval estimations of the mean of each stock is telling that the lower and higher estimation of mean of amazon is the most with the range of: -0.005548981610798781, 0.27604037484060673 (which means lower estimate mean is -0.5% while higher estimation is around 27.6%). Google have the least stock return with lower estimation being -2.37% and higher estimation being 17.37%. Apple, Facebook and Microsoft all of their conventional interval lies between these two ranges.
- The formula used for conventional estimation of proportion is:
, where p is the probability.
The conventional interval of proportion tells the statistical probability of something occurring, in our case probability of having a positive stock return. Apple, google, Facebook and Microsoft have the same statistical probability which is (0.5003438975423443, 0.5876561024576558). While amazon is having more chance to get a positive stock with probability (0.5144697338004759, 0.6015302661995242).
- The heatmap with hierarchical clustering based on the pairwise correlation among the five stock returns.
The result of conventional estimation of mean tells that there is less variation in lower estimation then in the higher estimation of mean.
The result of conventional estimation of proportion tells that only 1 that is amazon have a different value while all the other have same value.
Question 2
- Null hypothesis: More movies are money-making than money losing.
Alternate hypothesis: More movies are money-losing than money making.
Significance level: Usually this level is 5% so there is a 5% chance of accepting the alternate hypothesis.
Collection of Data: As our data was already collected in movies. xlxs
Test performance on data: after extracting the data and diving it into two groups – one group having more or equal to gross than budget and another group having more budget than gross. Then calculated the total number of movies lying in both of these groups.
Conclusion: “More movies are money losing than money making”, hence we reject the null hypothesis and will accept the alternate hypothesis.
- Simple linear regression model of Gross and Budget gives:
coefficient of determination: 0.5696152869871645
intercept: 28.66019341959354
slope: [0.32764754]
- The scatter plot of gross and budget is:
In the scatter plot the outlier point can be clearly seen which is when gross is around 1000 and budget is 250. After removing this outlier from data, the simple linear regression model gives the following outcome:
coefficient of determination: 1.0
intercept: [-2.13162821e-14 0.00000000e+00]
slope: [1.51211284e-16 -7.56056420e-17 ]
Question 3
- Here is the log summary of logistic regression:
Age, income and religion can seem to have a negative relationship with Party as their coef in the summary is negative. Age and income may have a negligible relation with party but Religion seem to have strong connection with party. Female, Married and education have a positive effect but the most strongly correlation exist between Married and party. Standard error is an element to see the good fitness of a model. As it tells the fluctuation of values from the mean value. Religion, married and female can be seen as inaccurate representation of the mean with respect to the party as they have more std err while age, income and education within the population seem to have a true mean value throughout.
- Predictions : [0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0]
Confusion Matrix : [[105 31]
[ 28 86]]
Test accuracy = 0.764
AS we perform a logistic regression above the prediction tells on basis of given values above that based on age, female, married, income, education and religion and each one of them combine relation with party for which party will that particular person will vote. Recall, precision and accuracy all are within the range of 72-77% which shows that the prediction that our model made with respect to the given data our model was quite accurate with its results.
- The number of voters in Group A are: 72
The number of voters in Group B are: 50
The conventional 90% interval estimations of the mean return of Group A is: (0.8578899996655144, 0.8836078073098971)
The conventional 90% interval estimations of the mean return of Group B is: (0.8279407718774723, 0.8488267854277971)
AS it can be seen Party 1 have more voters than Party 0 as group A is for party 1 voters with 75% of chance, they will be voting for Party 1 and party B the same but with vote to Party 0. The 90% confidence interval estimation also tells that the Group A population have more true mean than group B.
- The elbow method outcome was the following. The elbow shape bend can be clearly seen at 3 hence number of clusters should be 3 and just to verify I have plotted a scatter plot between two variables to see the clear-cut clusters.
The crosstab results were the following:
Yes, it can be clearly seen that all of the data is compatible with each other and the crosstab clearly shows how the party is divided into these three clusters or category.
Hire Expert Writers at
Affordable Price
WhatsApp
Get Assignment Help
Question 4
- The mean of number of billionaires in rich countries is: 969697
The mean of number of billionaires in not rich countries is: 18.323529
Result of the mean of billionaires in rich is greater than in not rich as it can be seen through bar plot too. Not rich is definitely more symmetric around its mean value because the grey line extending outside and inside the bar have the almost the same length on either side. But the Rich data is not symmetric near the mean because the line extending outside the bar is definitely longer than the one inside hence Rich have more fluctuated data.
- #### mean-Rich = meanN-Not_Rich ####
p-values [0.36657296 0.31869323 0.50598773 0.35726403]
we are accepting null hypothesis
#### mean-Rich = 2*meanN-Not_Rich ####
p-values [0.38570765 0.03963756 0.71746588 0.36639039]
we are accepting null hypothesis
As it can be seen that the mean values of the rich and not rich are relatively close to one another despite the fact that the data of rich is more spreader than not rich.
- The following scatter plots with fitted line:
The correlation coefficient of Number vs GDP is: 0.9617781244817305
The correlation coefficient of Number vs GDP per capita is: 0.07362287444706263
There is a strong relation between number of billionaires with GDP as it can be seen companies such as Microsoft, google, etc. contribute to the GDP and the owners of the company are usually a billionaire. While on GDP per capita there is no relation at all the reason is because of the population as a billionaire may contribute to economy but not to earning of an individual in the country.
- Here are the results of the multiple regression:
The coefficient of GDP per capita and population seem to have a stronger relationship than GDP and population but the standard error is greater in GDP per capita and population then in GDP and population. Number seem to have negative correlation when combines with GDP rather than GDP per capita. The results are compatible because as seem above number have no such effect on GDP per capita than with GDP mainly because of population. Hence Number will not be strongly correlated to GDP and population as population is included now.
- Here is the scatter plot of the predictions of Number on GDP and population and number on GDP per capita and population.
The US data: 64 United States 540 19197 322.622389 59.503 Rich
The role the US is playing in the fitted line is that it is making a curve tilted towards its value such as in the second figure I hand drawn it just to indicate it.