The Chi-square Goodness of Fit Test

Purpose

To determine if the observed counts of a nominal/categorical variable significantly differ from predicted counts under a null hypothesis. If the observed counts are derived from a categorised continuous variable, these can be compared to counts predicted by a theoretical distribution, e.g., normal.  

Research Question Examples
Is a dice fair? That is, did the observed outcomes correspond to a uniform distribution?

Can a sample be assumed to have been drawn from a normally distributed population? – Although, you would probably use the Shapiro-Wilk test of normality rather than the Ch-square test to determine this.

Can a coin be assumed to be fair given a particular number of heads when tossed 30 times? – Although, you would probably use the Binomial Test rather than Chi-square.

Given that the prevalence of adult smoking in the UK is 14.7%, how likely would a prevalence of 20% be in a sample of 100 adults?

Requirements

  • Independent random sampling
  • Categorical data
  • Category expected counts greater than 5

Background
Tolman et al. (1946) investigating maze learning in rats, wanted to determine if rats would show a preference for a particular choice of route when presented with four alternative routes (A – D). 32 rats were presented with the choice of routes, and the following route choices were observed

                Chosen Route
                   A    B   C   D
Observed   4    5   8  15

Whilst it is evident that there was a preference within the sample for route D, did this represent a preference in the wider population, or was it a consequence of sampling variation?

We will evaluate the null hypothesis that in the population that there is no preference for a particular route, that is the probabilities for selecting a route are the same (0.25) with each route having a predicted count of 8, with the alternative hypothesis being that the probabilities of selecting a route are not equal. 

We will use the chisq.test command specifying the counts (4,5,8,15) and their associated probabilities (0.25, 0.25, 0.25, 0.25) assuming the null hypothesis is true. The complete R commands  are shown below.

R Command
Enter the following command (you can copy and paste into R-Studio).

# Lets set-up a vector of values and assign them to the variable observed
observed <- c(4,5,8,15)

# Lets set-up a vector of expected probabilities and assign them to the variable exp.probs
exp.probs <- c(0.25,0.25,0.25,0.25)

# run the test
chisq.test(observed,p=exp.probs)

When executed, this command will generate the output shown below.

Interpreting The Output & Reporting the Analysis

Figure 1: R Output

Notes

  • The p-value is 0.01766, which is less than 0.05, so the finding in significant
  • This means we reject the null hypothesis and accept the alternative hypothesis; that is the population proportions are not equal to 0.25
  • When reporting most statistical tests we need to indicate
      • the statistic (χ2=9.25)
      • The sample size (N=32) and/or Degrees of freedom (3)
      • The p-value
  • The reported p-value is two sided, which is consistent with the non-directional hypothesis.
  • <- is the assignment operator in R
  • In fact, as the probabilities under the null hypotheses were equal, we didn’t need to specify them, as this is the default. Accordingly, the following command would have generated the same output: chisq.test(observed) or chisq.test(c(4,5,8,15))

References
Tolman, E. C., Ritchie, B. F., & Kalish, D. (1946). Studies in spatial learning. I. Orientation and the short-cut. Journal of experimental psychology36(1), 13.