“Information is the oil of the 21st century, and analytics is the combustion engine. – Peter Sondergaard

Statistical computing is the process through which data scientists take raw data and create predictions and models. Without an advanced knowledge of statistics it is difficult to succeed as a data scientist–accordingly, it is likely a good interviewer will try to probe your understanding of the subject matter with statistics-oriented data science interview questions. Be prepared to answer some fundamental statistics questions as part of your data science interview.

1. You fit a multiple regression to examine the effect of a particular variable a worker in another department is interested in. The variable comes back insignificant, but your co-worker says that this is impossible as it is known to have an effect. What would you say/do?

2. You have 1000 variables and 100 observations. You would like to find the significant variables for a particular response. What would you do?

3. What is the difference between Regression and Logistic Regression? Can you explain the Assumptions/Conditions?

4. If ANOVA is about comparing means, why is it called analysis of variance?

Hints: Suppose we gather the data for the simplest case above. Let’s say we wind the weights of (say) 20 single people, 30 married people, 10 cohabitating people, and so on. We can then take the mean of each group. The means will not be the same in the different groups. Nor will all people in any group weigh the same. That is, for example, not all single people weigh the same amount.

We have variation within each group, and we have variation between groups. One way of measuring variation is with a statistic called the variance. If most of the variance is within groups, then we cannot conclude that the groups are different with regard to weight. On the other hand, if most of the variance is between groups that is evidence that the groups differ. How do we decide which variance is bigger, and what it means? We analyze it. That is, we perform an ANOVA, an analysis of variance.

5. Explain the difference between R2 and adjusted-R2. When is R2 (or adjusted-R2) not useful?

6. What is Hosmer lemeshow test of goodness of fit? Option available for the same in SAS.

7. What do you understand by standard normal variable?

8. Explain Poisson distribution with an example?

9. What is the difference between skewness and kurtosis?

10. Why do we use DESC keyword in proc logistic?

11. What do you understand by residual chi square test?

12. What is the difference between exploratory data analysis and confirmatory data analysis?

13. Explain the four scale of measurement.

14. Can we draw a bar diagram for quantitative data? Explain

15. What is the difference between chi square test of independence and correlation?

16. What is p-value? Why we accept the null hypothesis when p-value is greater than given level of significance?

17. What do you understand by hypothesis testing?

18. What are the different methods of estimation?

19. What do you understand by standard error?

20. What is a random variable?

21. What is odds?

22. What is odd ratio?

23. What do you understand by Ordinary least square technique and where we apply it?

24. What do you understand by maximum likelihood technique and where we apply it?

25. What are one tail and two tail test?

26. Under what circumstances we apply t test or z test?

27. What do you understand by stepwise selection in logistic regression?

28. What is the difference between R-square and adjusted R-square?

29. Why we call regression as simple linear regression and multiple linear regressions?

30. What is confidence interval?

31. What is level of significance?

32. When we do factor analysis?

33. What is KMO-MSA measure?

34. What is Eigen value and the mineigen criterion to keep the factors?

36. What is cluster analysis?

37. What are the different methods of clustering?

38. What is Single linkage method?

39. What is dendrogram?

40. What are the demerits of chi square test of independence?

41. What is the purpose of Tukey’s test in two way anova?

42. Why we use the model keyword in proc glm or proc anova code?

43. When we use proc glm over proc anova?

44. What is goodness of fit in regression?

45. What are the assumptions of classical linear regression model? Explain

46. What is white noise?

47. How do we check for multicollinearity?

48. How we overcome the problem of multicollinearity?

49. What is spec test?

50. What do you understand by autocorrelation in regression?

51. Why do we split the data set into two i.e. training and validation?

52. What is VIF and its use?

53. What is ROC?

54. What is the measure of ROC?

55. What is classification table? Explain.

56. What is false positive and false negative?

57. What is the error term in regression equation?

58. What is Box Jenkins methodology?

59. What is stationarity?

60. How to make a time series data stationary?

61. What do you understand by mean stationary?

62. What do you understand by variance stationary?

63. What are the parameters for normal distribution?

64. What is the use of RANUNI keyword?

65. What are the four models of time series? Explain.

66. What is correlation and partial correlation?

67. What is individual null hypothesis?

68. What is global null hypothesis?

69. What is 1-specificity?

70. What is Durbin Watson test?

71. What do you understand by paired sample t-test? What is the assumption for paired sample t-test? How do we test it?

72. What do you understand by degree of freedom?

73. Under what condition we apply t-test or anova?

74. What is scatter plot?

75. What do you mean by explanatory variable or independent variable?

76. What is type I error?

77. What is type II error?

78. Why we need to standardize the variable before doing cluster analysis?