Summary of Maxwell, Kelly & Rausch (2008): Sample Size Planning for Statistical Power and Accuracy in Parameter Estimation

As academics and researchers, common sense would dictate that the larger the sample, the better the results of a study. However in reality, most researchers do not have access to indefinite participants and thus must be aware of how to determine the best sample size according to their research goals. Maxwell, Kelly and Rausch’s (2008) review of various sample size planning techniques in psychological research illustrate that determining the appropriate sample size for a study is much more intricate than one would think, and can be used for various ends such as improving statistical significance (power), precision of effect estimates (within narrow confidence intervals) and accuracy of results. In the ideal research design, all three aforementioned areas of improvement should be considered when thinking of sample size planning.

It is important to note that statistical power and accuracy are only relevant when hypothesis testing is involved, since the very definition of power is the probability of rejecting the null hypothesis in favour of an alternate one that is true. Power and accuracy are also not one in the same, but should both be considered (along with preciseness) in sample size planning, as it is important to showcase the degree of uncertainty of an effect size in order to produce trustworthy results, whether large or small. The authors emphasize that even if a sample size is determined as being sufficiently large to guarantee adequate statistical significance (power), it may still not be large enough to guarantee accurate parameter estimates of the effect. They discuss a method and formula for accuracy sample size planning called Accuracy in Parameter Estimation (AIPE), which emphasizes the reporting of confidence intervals along with statistical significance in order to showcase if the effect has been estimated accurately.

The authors then outline a review of various targeted statistical power and AIPE techniques in the literature based on psychological research design (please see article for references for specific techniques):

  • Comparing the means of two independent groups (via 2 group t-test): For statistical significance, the smaller the value of the population standardized mean difference, the larger the sample size should be for a specified level of power. For accuracy, the larger the value of the population standardized mean difference, the larger the sample size should be for a desired/specified confidence interval width.
  • Adjustment for multiple mean differences comparison procedures (via ANOVA): This adjustment has to be taken into account in sample size planning otherwise the sample size will be too small for both statistical significance and accuracy.
  • Multiple regression (via 2×2 framework): Sample size planning techniques for multiple regression depends largely on whether the effect size of interest is omnibus or targeted as each requires a different sample size.
  • General linear multivariate model: Although there are effective power and AIPE techniques for this, most assume that the study employs fixed predictors even though most psychological predictor variables are continuous and random. The authors outline techniques for both fixed and random predictor studies.
  • Exploratory factor analysis: The authors caution that the rules of thumb in the literature for this type of study should not be trusted and that sample size planning techniques should take into account communalities as part of the procedure.
  • Confirmatory factor analysis and structural equation modelling: although AIPE techniques have not yet been developed for this domain, the authors outline sample size planning for power using chi-square ratio tests (to evaluate exact fit of the model) or fit indices (to quantify how well the model fits the data).
  • Longitudinal data analysis (via Latent growth curve (LGC) models for analysing change): The authors outline techniques for both continuous and discrete outcomes. The former involves techniques to detect group differences, patterns of attrition and the effect of missing data on power, while the latter involves using sample size tables to detect treatment effects between two groups based on number of repeated observations, group response rates and intraclass correlation.
  • Generalized linear models (modelling categorical variables, contingency tables and variables not normally distributed): The authors outline a version of the Wald test to test multiple parameters simultaneously for statistical power. They also caution that when a continuous variable is categorized (recoded as ordinal), the sample size must be increased in order to reduce the loss of power effect.
  • Cluster randomized trials (compares clusters of individuals such as classrooms, schools or neighbourhoods): For both power and accuracy a certain number of clusters is required even if the number of participants within each cluster is high. The authors mention a multilevel modelling method to determine statistical power, and a technique to determine the necessary number of both clusters and participants within each cluster in order to be able to detect a meaningful treatment effect.
  • Survival analysis (outcome of interest is the duration of time until a particular event occurs): The authors mention the Weibull model and Cox regression model to determine appropriate sample size for both power and accuracy.
  • Mixture modelling (decomposing an observed (composite) distribution into multiple unobserved (latent classes/groups) distributions): Sample size planning in this case is dictated based on the statistical method utilized, but generally results are better when the sample is larger. However, the authors caution that a large sample can sometimes overestimate the number of latent classes present; thus theory needs to be play a role in determining that number.

The authors also mention that for any research goal, design or model, a simulation approach (Monte Carlo simulation technique) can be used prior to conducting the analyses in order to determine appropriate sample size for power and accuracy. This technique involves generating random data from the population of interest, conducting selected statistical techniques and repeating them a large number of times with different sample sizes until the appropriate minimal sample size is found.

Although the authors emphasize that sample size is not the end-all-be-all for increasing statistical power and accuracy of a study’s results, it is still a very important aspect to research design especially to detect smaller effects. However, the true sample size required to obtain adequate power and accuracy can often exceed the resources available to the researcher. In these cases, the authors recommend either reducing the required confidence interval narrowness, utilizing within-subject designs or introducing co-variates into the analysis. They also suggest using meta-analysis (with a priori power analysis), since multiple smaller studies can often provide much more power than a single larger study with the same total sample size. However, they caution that meta-analyses often suffer from publication bias (referred to as the “file drawer effect”) due to unpublished studies and selective reporting of results in the literature. They recommend using multi-site studies to counteract this issue, as they are less prone to the “file drawer effect”.

According to the authors, there is not enough emphasis in the literature on sample size planning techniques to improve statistical power and accuracy, since most published studies test multiple hypotheses with often weak significances. Those that report large effects often do so without the use of confidence intervals for precision or accuracy, which can distort the predicted power of such studies. The authors call this phenomenon as underpowered research, since most studies are not equipped to detect medium to small effects that still might be scientifically or practically relevant. This creates a false sense of research certainty in the literature, and can cause readers to misinterpret discrepancies between similar studies as contradictory when the discrepancies could actually be due to simple sample variance. The authors also caution that there is an unrealistic expectation in the psychological research that a single study is able to provide definitive answers on a particular issue or treatment approach. According to them, knowledge should be treated as cumulative, not definitive, and this can be accomplished through published studies over time that take into consideration effective sample size planning techniques and approaches. Knowledge is only power when those techniques are applied.

Reference: Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537-563.

Leave a Reply

Blog authors are solely responsible for the content of the blogs listed in the directory. Neither the content of these blogs, nor the links to other web sites, are screened, approved, reviewed or endorsed by McGill University. The text and other material on these blogs are the opinion of the specific author and are not statements of advice, opinion, or information of McGill.