Statistical Testing – Why, What, and How?

Statistical testing or hypothesis testing is a popular statistical technique to make data-driven decisions. This article briefly describes why we need statistical testing, what could be accomplished with statistical tests, and the basic mechanism of how statistical tests work.

Why do we need statistical testing?

To answer this, let’s first state a few statistical preliminaries. A population is the set of all items under study. Most often we are interested in phenomenon that involve gigantic populations — all internet users, all voters, all manufactured microprocessors, etc. A census is the enumeration of all items in a population. In many cases, a census is infeasible and quite expensive. As an alternative to census is sampling. A sample is a subset of the population. A sample is succinctly represented by a statistic, e.g., a mean, a median, etc.

Now let’s look at an example. Suppose we are changing the layout of our internal corporate website for our small organization. Since the users of this website are only our current employees, the population consists of our current employees. A census could be used to evaluate whether the new website is on average preferred over the previous one by our current employees. This is the realm of descriptive statistics, i.e., to compute quantities such as mean, median, mode, etc., for analysis.

Now consider we want to evaluate a new layout for our sales website. In this case the items under study are all potential customers. To ensure that the new layout is a positive change, we would perform a census. In two parallel universes, we will clone all potential customers. The potential customers in the first universe would be presented with the old layout and the second universe will be presented with the new layout. The sales volume will be noted for both universes. If the sales volume for the new layout is greater than the old layout, we will declare the new layout a positive change without uncertainty!

It is obvious that a census is impossible to decide on the sales website layout: the set of all potential customers is unknown (bounded by about 7 billion), getting the potential customers’ glance at our sales website would be prohibitively expensive, and creating parallel universes is impossible. However, we could randomly sample a subset of all potential customers — a subset of our sales website visitors. We could present the new layout to this subset. Can we use this sample to make conclusions about the new layout? If the average sales volume for the visitors in the sample is $2\times$, $3\times$ or $4\times$ the baseline average for the old layout, can we conclude the new layout will result in higher sales volumes at the population level? What if our sample only consists of one customer? How about a sample size of ten, hundred, thousand? What if by chance the selected visitors happen to be big spenders? We are asking questions about a population but we have observed only a sample.

To generalize from a sample statistic to a population parameter we need to employ inferential statistics. Statistical testing uses sample data to answer questions about the population. These answers are not certain but has a reasonably high probability of being correct.

What questions could one ask about a population?

It depends on the type of data we are dealing with: continuous, binary, categorical, ordinal, etc. In all cases, it is assumed that the sampling from the population is random and representative of the population. Some of the most common types of questions, that statistical testing can answer, about a population are described below.

Continuous Data

A sample of continuous data consists of values that could take on a continuum of values, e.g., [1.96, 10.23, 8.12,…].

  • A mean and a value. Assume we have a hypothesized target value, say the average latency of loading this blog to be less than 450 milliseconds. Statistical testing could use a small sample of visitors to counter this claim with a reasonable level of certainty.
  • Two means. Assume two groups of students. One group learns a technical concept by reading a textbook chapter and the other group learn the same material by watching a video lecture. Both groups write an exam. Given the exam scores, statistical testing could be used to confirm whether a significant difference exist between the two learning methods.
  • Correlation. Assume you observe a correlation between two continuous quantities, e.g., salary bonuses and the number of deals closed, in a random sample. Does this correlation exist in the population?

Binary Data

A sample of binary data consists of only two values 0 and 1. As such the information contained in a binary sample is considerably less than that of continuous data. A binary sample could be represented by a single number, the proportions of 1s. For example, the sample [1,0,1,0,0] maps to a proportion of 40% 1s. Some statistical tests for binary data are given below

  • Proportion to value. Suppose our best estimate tells us that the proportion of households with three cell phones is 30%. Before launching an ad campaign or a product for this 30% users, we could survey a few households. The proportion of households with three cell phones in this sample could be used to infer whether the proportion of three cell phone households is 30% in the population.
  • Proportion of two groups. Consider the proportion of defective devices in samples from two manufacturing processes. A hypothesis test for proportion could be used to determine whether one manufacturing process is any better than the other?

Count Data

  • Rate to target. As an example of count data statistical test consider the example of an online server. Based on theoretical analysis the server should be able to complete at least 1 million requests per day. To determine whether this statement is false we could use a hypothesis test. A statistical test could also be used to compare two alternative services to determine if a difference between them exists.

Categorical Data

  • Association between two categorical variables. Similar to the correlation test for continuous data, an association test could be used to determine if two categorical variables are associated with each other, such as, whether there is a meaningful association between the color of our website presented to the customer and the method of payment chosen by the customer?

How does statistical tests work?

In this section we will go over the basic mechanism of how statistical tests work. We will first present some more statistical preliminaries and then go over how a one-sample t-test for the mean is performed.

Assume we are uncovering some truth. We perform $N$ experiments to get a sample of observations $X_1, X_2, \dots, X_N$. We assume the truth remains fixed and our experiments are independent. Thus each observation is assumed to be independent and identically distributed according to a distribution $\mathcal{D}$ with fixed mean $\mu$ and variance $ \sigma^2$: $X_i \sim \mathcal{D}(\mu, \sigma^2)$. The average of the observations is given by
$$\overline{X} = \frac{1}{N} \sum_i X_i$$

Now imagine we conduct a million trials of these $N$ experiments to get a million samples. In this case the sample mean and sample variance of $\overline{X}$ will be given as
$$\mu_{\overline{X} } = \mathbb{E}\big[\overline{X}\big] = \mathbb{E}\big[\frac{1}{N} \sum_i X_i\big] = \frac{1}{N} \sum_i\mathbb{E}\big[ X_i\big] = \mu$$
$$\sigma_{\overline{X} }^2 = \text{Var}\big[\overline{X}\big] = \text{Var}\big[\frac{1}{N} \sum_i X_i\big] = \frac{1}{N^2} \sum_i\text{Var}\big[ X_i\big] = \frac{1}{N^2} \sum_i\sigma^2 = \frac{\sigma^2}{N},$$
where $\mu$ and $\sigma^2$ are the expectation and variance of the distribution from which the samples were obtained. What does the distribution of $\overline{X}$ looks like? If we can find something about this sampling distribution we might not need to perform one million trials!

Central limit theorem

Central limit theorem states that for large $N$, $\overline{X}$ converges to a Gaussian distribution
$$ \overline{X} \rightarrow \mathcal{N}\big(\mu, \frac{\sigma^2}{N}\big).$$
Re-arranging to use a standard Gaussian distribution with zero mean and unit variance
$$\text{Z-score} = \frac{\overline{X}\; -\; \mu}{\frac{\sigma}{\sqrt{N}}} \rightarrow \mathcal{N}\big(0, 1\big),$$
where the quantity on the left is known as the Z-score.

Central limit theorem. Irrespective of the distribution of $X$ (Bernoulli in this case), the sampling distribution of $Z=\frac{\overline{X}\; -\; \mu}{\frac{\sigma}{\sqrt{N}}}$ is Standard Normal for large N.

Sampling Distribution

A statistic is a scalar that summarizes a sample. The Z-score described above is a sample statistic. Each statistic is accompanied by its corresponding probability distribution. Some examples of statistics and their sampling distributions are given below.

StatisticFormulaDistribution
z-statistic $ \frac{\overline{X}\; -\; \mu}{\frac{\sigma}{\sqrt{N}}}$Standard Normal
t-statistic$ \frac{\overline{X}\; -\; \mu}{\frac{\sigma_{\overline{X}}}{\sqrt{N}}}$t-distribution
f-statistic ${\frac {{\text{between-group variance}}}{{\text{within-group variance}}}}$f-distribution
Test statistic and their distributions

Statistical tests uses some elements of logic, particularly, the technique of proof by contradiction. Thus, we review proof by contradiction briefly before finally putting everything together to perform a statistical test.

Proof by Contradiction

In logic a proof by contradiction validates a proposition by illustrating that assuming the proposition to be false leads to a contradiction. Consider the following example.

Proposition: There is no largest number.
Proof:
1. Assume there is a largest number, let it be $L$. There is no number larger than $L$.
2. Consider the number $L+1$. This number is larger than $L$
3. Since assuming that there is a largest number leads to a contradiction we conclude that there is no largest number.

Please note that coming up with a counter-example in step 2 that leads to a contradiction validates the proposition. But, if we fail to come up with a counter-example in step 2 that does not mean that we have invalidated the proposition. In this case we cannot conclude anything about the proposition.

Example Test: t-Test for the mean

We are ready to look at our first statistical test. For this test we will use a hypothetical example. Assume that the government has published data on the nation-wide average height of adults. Let’s assumed this average to be 168 centimeters. For planning some infrastructure for our college we need to use this average height value. However, there are claims that the average person’s height in our college is significantly different from the national average of 168 cm. How can we confirm this claim?

We will try to make the case for the average height being different from 168 cm in a manner similar to the proof by contradiction. We hypothesize that the average height is 168 cm. We call this the null hypothesis $\mathbb{H}_0$. Next we collect a sample of heights from say $N=100$ people — this gives us our sample mean. We will use statistics to look for a contradiction — an anomaly, a very rare situation. How rare do you ask? At least as rare as a constant $\alpha$ known as the significance level, usually set to a chance of 5 out of 100. We will succeed if we can show that the sample we observed in the experiment is very very rare if our null hypothesis is true. If we fail to cause this rare condition, we abort. Otherwise, we will conclude that our null hypothesis causes very rare phenomenon (contradiction in logic) thus the null hypothesis is wrong. This would mean that the opposite of the null hypothesis, the alternate hypothesis, is correct: the average height is different than 168 cm.

  • Null Hypothesis $\mathbb{H}_0$: The average height is equal to 168 cm.
  • Alternate Hypothesis $\mathbb{H}_1$: The average height is not equal to 168 cm.
  • We set the strength of evidence using the significance level to be $\alpha = 0.05$.
  • We set a budget for the experiment: $N=100$
  • Sample: we ran an experiment and measured heights of $N=100$ people
  • We calculated the t-statistic for the sample

The t-statistic turned out to be equal to $2.1$. Is this value of t-statistic rare enough? We know the distribution of the t-statistic — we can use the t-distribution to determine how likely the value 2.1 is?

t-distribution. The t-value of 2.1 corresponds to a p-value (integral of the green hatched areas under the curve) of 0.036. Since the p-value is less than the significance level of $\alpha=0.05$, we can conclude that the sample we observed is indeed rare — perhaps almost an anomaly!

As can be seen, the p-value which is the probability of observing a t-value at least equal to 2.1 is 0.036. Thus, observing the t-value of 2.1 or higher is quite rare — rarer than the standard of 0.05 that we set out before beginning our experiment. Assuming the null hypothesis (average is 168 cm) to be true leads to a rare condition (akin to a contradiction). Therefore, we reject the null hypothesis and declare the alternate hypothesis to be correct: the average height is not equal to 168 cm.

Designing the Test

Each statistical test has some configuration parameters that needs to be selected carefully based on domain knowledge and the problem at hand. For example, how much error can we tolerate in our tests? How large a sample is enough and how large of an effect or difference from the null hypothesis is useful to detect.

How large an error to tolerate?

Everything else held constant, the significance level $\alpha$ directly determines the type I error of the test. For example, if $\alpha=0.05$ it means if we repeat the test 100 times, 5 out of the 100 tests might be false positives. Reducing the significance level decreases the false positives and improves confidence in a positive test result.

However, everything else held constant, decreasing $\alpha$ increases the false negative rate — that is, there may be an effect present that we would not be able to detect due to the too stringent requirement of a too low significance level. This tradeoff is shown in the figure below.

Error rates of statistical tests. The false positive probability is shaded in red and the false negative probability is shaded in yellow. As the false positive probability is reduced the statistical power, i.e., the ability of the test to correctly detect an effect (shaded in green) also reduces.

How large a sample size to use?

We use statistical tests because we want to use as fewer samples as possible instead of surveying the whole population. This begs the question of how large a sample is still acceptable? The first constraint on the sample size is that of the central limit theorem — the sample size should be large enough to satisfy the statistical assumptions made in the development of statistical tests. Typically samples of size 30 or more satisfy the requirements of the central limit theorem.

The next consideration is that of statistical power. Everything else held constant, a larger sample size can detect much smaller effect size. More on this in the next section.

Sample sizes. Assuming an effect size of 0.5, a sample size of 10 results in a large missed detection probability (yellow region) whereas a larger sample size results in much smaller probability of missed detection.

How large a difference to detect?

How large a difference between the two quantities do we want to detect? This is a decision to be made using domain/business logic. In our average height problem, for example, we might only care about differences greater than 6 cm. Everything else being held constant, increasing the relevant effect size reduces the probability of missed detection — type II error.

Effect sizes — larger effect size results in smaller type II error.

Trade offs

Usually the significance level is set by convention to 0.05 or 0.01 that determines the type I error. The effect size is set by the problem at hand. The statistical power determines how much you want to avoid type II error — that is not to miss an effect that might exist. This leaves us with the sample size that can be calculated based on all the previous quantities. To understand these trade off let’s look at the t-statistic once more

$$ \text{t-statistic} = \frac{\overline{X}\; -\; \mu}{\frac{\sigma_{\overline{X}}}{\sqrt{N}}} = \frac{\overline{X}\; -\; \mu \cdot \sqrt{N}}{\sigma_{\overline{X}}} = \frac{\text{effect size} \cdot \text{sample size}}{\text{sample variance}}$$

If the sample variance is fixed, to get a higher t-value (and detect an effect) we either need to set a larger effect size or collect a larger sample size.

Feedback and Further Reading

Thanks for reading this far! Please do share your feedback if you found anything useful, confusing or incorrect. If you want to read further on statistical testing please consult the following books

  1. Hypothesis Testing: An Intuitive Guide For Making Data Driven Decisions by Jim Frost
  2. All of Statistics: A Concise Course in Statistical Inference by Larry Wasserman

Leave a Reply

Your email address will not be published.