Descriptive Statistics

Frequency Table

For Categorical Data

Given the following data set:

The resulting frequency table is:

Blood TypeFrequencyRelative Frequency
A66/15=0.4
AB33/15=0.2
B11/15=0.0667
O55/15=0.3333
Total:151

Note: Total = n

\boxed{ \text{Relative Frequency} = \frac{ \text{Frequency} }{ \text{\# of Observations} } } \\~\\ \small\textit{\textit{Percentage} = \textit{Relative Frequency} $\times 100$}

Remember: A table should summarize its data.

For Numerical Data

How to construct a frequency table for numerical data:

Rules:

\boxed{ \text{Ideal \# of Classes: } k = [ \sqrt{n} ] }

\boxed{ \text{Width of Each Class: } w = \frac{ \text{largest} - \text{smallest} }{ k } }

Example: Creating a frequency table

Q: Task Time Data: 19, 23, 26, 30, 32, 34, 37, 39, 41, 44, 44, 46, 55

A:

Number of observations: $n = 13#

Number of classes: k = [ \sqrt{13} ] = 4

Width of intervals: w = \frac{55-19}{4} = 9

So:

ClassIntervalFrequencyRelative FrequencyPercentage
119 \le x < 2833/1323.0769
228 \le x < 3733/1323.0769
337 \le x < 4655/1338.4615
446 \le x \le 5522/1315.3846

Bar and Pie Chart

Bar Graph: Graph made of bars whose heights represent the frequencies of respective categories.

Pie Chart: Circle divided into portions that represent relative frequencies (or percentages) of respective categories.

Histogram

Histogram: Chart where adjacent rectangles heights’ and widths’ represent the frequencies1 and the widths of the class respectively.

Distributions of Data:

  1. Symmetric
  2. Skewed (left/right)
  3. Uniform
  4. Bimodal

Density Histogram

Density Histogram: A histogram that uses density rather than frequency.

\text{Density} = \frac{ \text{Relative Frequency} }{ \text{Width} }

Example: Calculating density

Q: Calculate density

ClassIntervalFrequencyRelative FrequencyPercentage
119 \le x < 2833/1323.0769
228 \le x < 3733/1323.0769
337 \le x < 4655/1338.4615
446 \le x \le 5522/1315.3846

w = 9

A:

ClassIntervalFrequencyRelative FrequencyPercentageDensity
119 \le x < 2833/1323.07693/13 \div 9 = 0.0256
228 \le x < 3733/1323.07693/13 \div 9 = 0.0256
337 \le x < 4655/1338.46155/13 \div 9 = 0.0427
446 \le x \le 5522/1315.38462/13 \div 9 = 0.0171

Density Historygram: Total area of histogram = 1

Dotplot

Dotplot: Chart where you place a dot above a number line to represent the data set.

Histogram or Dotplot?

Line Plot

Line Plot: Like a dot plot, but you draw lines from the top of each plot.

Measures of Center

Mean: Arithmetic Mean

\text{Mean} = \frac{\text{Sum of All Observations}}{\text{\# of Observations}}

Sample Mean (\bar{x}): Mean calculated for sample data.

Population Mean (\mu): Mean calculated for population, usually unknown.

\mu = \frac{\sum x_i}{N} \quad \text{and} \quad x = \frac{\sum x_i}{n}

Example: Calculating mean

Q: Find the mean of the data set:

27, 31, 30, 32, 34, 35

A:

\begin{aligned} x &= \frac{27 + 31 + 30 + 32 + 34 + 35}{6} \\ &= \frac{189}{6} \\ &= 31.5 \end{aligned}

Thus, the mean is 31.5.

Median: The middle value of a sorted data set.

\begin{aligned} \text{If $n$ is odd: }& x_{ \frac{n+1}{2} } \\ \text{If $n$ is even: }& \frac{ x_{ \frac{n}{2} } + x_{ \frac{n}{2} + 1 } }{ 2 } \end{aligned} \text{If $n$ is odd: } x_{ \frac{n+1}{2} }

Note: Order Notation

\text{Raw Data: } x_1, x_2, x_3, ..., x_n \to \text{Sorted Data: } x_{ ( 1 ) }, x_{ ( 2 ) }, x_{ ( 3 ) }, ..., x_{ ( n ) }

Measures of Position

Quarties

Quartiles: Summary measures that divide a sorted data set into four equal parts.

Q_1: Median of the lower half of the data.

Q_2: Median of the entire data set.

Q_3: Median of the upper half of the data.

Example: Calculating mean

Q: Find the quartiles of the data set:

1, 2, 8, 12, 14, 17, 20, 22, 23, 24, 25

A:

Q_2 is 17 (it is the center of the dats set)

Q_1 is 8 (we ignore 17 when getting the center of the lower half)

Q_3 is 23 (we ignore 17 when getting the center of the lower half)

Note: If n has been even—that is, we had two numbers in the center—those numbers would be used in their respective calculation for Q_1 and Q_3.

Percentiles

kth percentile (P_k): k\% od the data is smaller than this value.

Example: If you scored in the 90th percentile (P_{90}) on a test, 90% of the test scores are below you.

\boxed{ P_k = \text{Value of the $\frac{kn}{100}$th term in a sorted data set} } \\~\\ x_{\frac{kn}{100}}

Quartiles Notated as Percentiles: Q_1 = P_{25} \\ Q_2 = P_{50} \\ Q_3 = P_{75} \\

Example: Calculating percentile

Q: Find the 30th percentile of the data set:

4 8 9 15 18 19 20 22 25 28 30

A:

n=11, k=30

First, let’s find the position of the 30th percentile (remember, an index must be a whole number): P_{30} = \frac{30 \times 11}{100} = 3.3 \approx 3

Our answer is the value at the position we just found: P_{30} = x_3 = 9

Thus, the 30th percentile of the data set is 9.


Q: Find the 65th percentile of the data set:

4 8 9 15 18 19 20 22 25 28 30

A:

n=11, k=65

First, let’s find the position of the 30th percentile (remember, an index must be a whole number): P_{30} = \frac{65 \times 11}{100} = 7.15 \approx 7

Our answer is the value at the position we just found: P_{65} = x_7 = 20

Thus, the 65th percentile of the data set is 20.

Measures of Spread

Measures of Spread: Measures the variation of data.

Why?: Measures of Center don’t reveal the whole picture of the distribution of a data set.

Range & IQR

Range: The total width of the distribution.

\boxed{ \text{Range} = x_n - x_1 } \\~\\ \small\textit{max - min}

Example: Calculating range

Q: Find the range of the data set

10, 12, 17, 20, 50

A: 50 - 10 = 40

Interquartile Ranger (IQR): The width of 50% of the data in the middle of the data.

\text{IQR} = Q_3 - Q_1

Example: Calculating IQR

Q: Find the IQR of the data set:

33, 35, 37, 40, 41, 42, 44, 46, 50

A:

33, \textcolor{red}{35, 37,} 40, \textcolor{green}{41}, 42, \textcolor{blue}{44, 46,} 50

Thus, the IQR is 45 - 36 = 9

Deviation

Deviation: Distance from a data point to the mean.

\boxed{ \text{Deviation} = x_i - \bar{x} }

Sample Variance (s^2): An average of squares of deviations

\boxed{ \text{Sample Variance: } s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_1 - \bar{x})^2 = \frac{1}{n-1} { \sum_{i=1}^n x_i^2 - n \bar{x}^2 } } \\~\\ \small\textit{1st form: Layman's version, 2nd form: Efficient version}

Layman’s Version: Find the deviation of each data point, square each deviation, add the squared values, and divide by n-1.

Example: Calculating deviation and sample variance

Q: Find the deviation for each data point in the data set:

1.6, 2.4, 3.5, 4.3, 4.8

A:

Mean (\bar{x}) = \frac{1.6 + 2.4 + 3.5 + 4.3 + 4.8}{5} = 3.32

ValueDeviation (d_i)d_i^2
1.61.6-3.32=-1.722.9584
2.42.4-3.32=-0.920.8464
3.53.5-3.32=0.180.0324
4.34.3-3.32=0.980.9604
4.84.8-3.32=1.482.1904
Sum06.9880
\text{Sample Variance (Layman's Form)} = \frac{6.9880}{n - 1} = \frac{6.9880}{5 - 1} = 1.747
Example: Calculating sample variance (efficient form)

Q: Find the deviation for each data point in the data set:

1.6, 2.4, 3.5, 4.3, 4.8

A:

Mean (\bar{x}) = \frac{1.6 + 2.4 + 3.5 + 4.3 + 4.8}{5} = 3.32

Data (x_i)x_i^2
1.61.6^2=2.56
2.42.4^2=5.76
3.53.5^2=12.25
4.34.3^2=18.49
4.84.8^2=23.04
Sum62.1

Recall the formula: \boxed{ s^2 = \frac{1}{n-1} { \sum_{i=1}^n x_i^2 - n \bar{x}^2 } }

In the above table, we found that \sum_{i=1}^n x_i^2 = 62.1, and we can now plug in the rest of the formula:

\begin{aligned} s^2 &= \frac{1}{n-1} { \sum_{i=1}^n x_i^2 - n \bar{x}^2 } \\ &= \frac{1}{n-1} { 62.1 - n \bar{x}^2 } \\ &= \frac{1}{5-1} { 62.1 - 5 \times \bar{3.32}^2 } \\ &= \frac{1}{5-1} { 62.1 - 5 \times 3.32^2 } \\ &= 1.747 \end{aligned}

This formula saves a lot of time over the layman’s form.

Sample Standard Deviation (s): s = \sqrt{s^2}

Example: Calculating sample standard deviation

Q: Given that s^2 = 1.747, find s.

A: s = \sqrt{1.747} = 1.3217

Summary:

Boxplot

Boxplot: Shows center, spread, and skew of a data set.


  1. Can be frequency or relative frequency.↩︎

  2. Very useful in medical sciences when looking at data sets with different n sizes.↩︎