Descriptive Statistics is the building block of data science. Advanced analytics is often incomplete without analysing descriptive statistics of the key metrics. In simple terms, descriptive statistics can be defined as the measures that summarize a given data, and these measures can be broken down further into the measures of central tendency and the measures of dispersion.
Measures of central tendency include mean, median, and the mode, while the measures of variability include standard deviation, variance, and the interquartile range.
We will cover the topics given below:
- Standard Deviation
- Interquartile Range
Measures of Central Tendency
Measures of central tendency describe the centre of the data, and are often represented by the mean, the median, and the mode.
Mean represents the arithmetic average of the data.
In simple terms, median represents the 50th percentile, or the middle value of the data, that separates the distribution into two halves.
Mode represents the most frequent value of a variable in the data. This is the only central tendency measure that can be used with categorical variables, unlike the mean and the median which can be used only with quantitative data.
Measures of Dispersion
Dispersion which is also referred to as variability, scatter, or spread. The most popular measures of dispersion are standard deviation, variance, and the interquartile range.
Standard deviation is a measure that is used to quantify the amount of variation of a set of data values from its mean. A low standard deviation for a variable indicates that the data points tend to be close to its mean, and vice versa.
Variance is another measure of dispersion. It is the square of the standard deviation and the covariance of the random variable with itself.
Interquartile Range (IQR)
The Interquartile Range (IQR) is a measure of statistical dispersion, and is calculated as the difference between the upper quartile (75th percentile) and the lower quartile (25th percentile). The IQR is also a very important measure for identifying outliers and could be visualized using a boxplot.
Another useful statistic is skewness, which is the measure of the symmetry, or lack of it, for a real-valued random variable about its mean. The skewness value can be positive, negative, or undefined. In a perfectly symmetrical distribution, the mean, the median, and the mode will all have the same value. However, the variables in our data are not symmetrical, resulting in different values of the central tendency.
The skewness values can be interpreted in the following manner:
- Highly skewed distribution: If the skewness value is less than −1 or greater than +1.
- Moderately skewed distribution: If the skewness value is between −1 and −½ or between +½ and +1.
- Approximately symmetric distribution: If the skewness value is between −½ and +½.