COSP Home
Archives   Calendar    Coordinating Center    Multisite Activities    
Project Description
  COSP References  Study Sites    Search Our Site   
Table of Contents
    Upcoming Meeting Agendas

______________________________________________

 

Statistical Power in the COSP Studies
Coordinating Center Discussion Paper

August 4, 1999

Introduction

Many of our discussions of cross-site research design issues have made reference to the impact of our decisions on statistical power.  Key issues of measurement, reliability, design, cross-site differences in programs and the selection of participants are important largely because of their effects on statistical power. The purposes of this paper are to lay out the terminology and basic concepts of power analysis in fairly non-technical ways, to make the importance of having adequate power and some of the issues surrounding the concepts more accessible to consumers and nonstatisticians, and to explore the potential implications of alternative design and logistical decisions for the outcomes of the research.  It attempts to clarify the relationships among sample size, variability, statistical tests, etc., and our operating assumption that worries about power should not always be addressed by trying to increase sample size.

We also present, in the Appendix, preliminary estimates of  the power to detect likely effects of the COSP programs.  These are based on planned sample sizes in the individual sites, the three clusters of COSP types, and all 8 sites if it were possible to pool their outcome data.  Throughout, we draw on a number of excellent discussions of statistical power and design issues in treatment or program effectiveness research (Cohen, 1988; Dennis, 1997; Lipsey, 1990; 1994) and more specifically comparative program evaluations in field setting (Boruch, 1997; Murray, 1997; Posavec, 1998).  We also summarize a number of points made by COSP investigators and consultants during Steering Committee and Logistics Subcommittee discussions.  Because we have attempted to paraphrase technical discussions, there is some loss of precision in the language, but even experienced researchers sometimes find the constellation of concepts surrounding power difficult to think and talk about in ways that are intuitive and not controversial.  The aim here is to provide a basis for further discussion, which may include more differentiate analyses directed toward specific issues only touched on here.

The Meaning of Statistical Power

In the context of a comparative evaluation of two programs or services, statistical power is the probability that a statistical test will lead to a conclusion that there is a difference between groups on an outcome measure, when in fact there really is such a difference.   It is the probability of correctly detecting a real difference.  Power is what we want to maximize in a research design in order to make it sensitive to actual program effects.  When power is high, we reduce the probability of committing what is known as a “Type II error,” which is incorrectly concluding that there is no real difference between the two programs, when in fact there is.  (In the vivid, poetic language of statistics, the other type of error it’s possible to make is called “Type I error”.  If Type II is a “false negative” Type I is a “false positive”—concluding that there is a difference between the groups when there really isn’t.)  We all want the COSP analyses to be empowered, in this sense of having a high probability of detecting any change in participants that is really “out there.”  In the technical language of null-hypothesis statistical testing:

Text Box: A Type I error is committed when we reject the null hypothesis when it s true.  The probability of a Type I error is denoted by " (alpha).
A Type II error is committed when we accept the null hypothesis when it is false and the research hypothesis is true.  The probability of a Type II error is denoted by $ (beta).
Statistical Power is the probability of rejecting the null hypothesis when it false.  It is equal to 1-$

 

 

 

 

Why We Need It

We all want to draw the correct conclusions about the COSP programs, for the research to be “valid” in this general sense.  It would seem to be a basic feature of professionally designed research that it be able to draw the correct conclusions, once the evidence is in.  But, surprisingly, many program evaluations and other applied research studies have been conducted and published reporting “null results” or “no difference” findings—failures to find better outcomes in an innovative program or service—when an analysis of the research designs shows that they did not have the statistical power to detect effects that would be likely to have some practical importance.   For example, an important review of a large number of published evaluations across several areas (Lipsey et al. 1985) found that many (or the majority) of those reporting null results were woefully underpowered--they had little or no chance of detecting the small-to-moderate effects typical of new programs.  This explanation for the preponderance of no-difference findings led the authors to title the influential article summarizing their work:  ”Evaluation: The state of the art and the sorry state of the science.”  Cohen’s (1992) review of a wide range of research areas, attention to securing adequate statistical power had not really increased substantially over the previous 20 years.   It is not unreasonable to hope that the power of current studies like those in COSP can do better.

An earlier theme in COSP multisite research discussions was the importance of considering modifications to research designs because we want to do everything possible to avoid being unfair to the COSP programs by not identifying true program effects.   Since the early days of program evaluation and field experimentation researchers become much more conscious of  how crucially important it is to have adequate power in a program evaluation.  They believe that we have a heavy responsibility to program beneficiaries, program planners and providers, funders, and policymakers, to do everything in our power to ensure that publicly funded studies of programs in general, and especially of new programs that show great promise, have adequate statistical power.  Underpowered studies could detour us from paths that would have proved fruitful, not only hampering the multisite knowledge development effort about differences among COSPs and the effectiveness of various ingredients, but unfairly damaging the prospects of COSPs overall. 

In fact the potential to increase the power to detect effects and variation in program effects is one of the main justifications for doing a multisite study.  Although the researchers at each of the sites have conducted power analyses as part of their proposals, the government funds multisite studies not only to increase the understanding the effects of differing programs like the several types of COSPs, in their different community settings and with differing participants, but to increase the possibility of asking—and getting the correct answer to—questions about overall differences in the effects of COSPs, across a wide range of outcomes, and possibly for subgroups of participants.  Even if single sites have adequate power to detect overall differences on many outcomes, it is clear that at least some sites will not be able to determine statistically whether, for example COSPs lead to better outcomes for women than for men.  There just won’t be a sufficient number of both men and women who are followed to yield statistically stable results for subgroup comparisons.

How do we get it? Factors that Determine Power

When people talk about statistical power, they are most often talking about “N”, sample size, the number of participants in the programs being compared. But in addition to sample size, power is determined by the alpha level [all the “p<.05"’s  you see in research reports] and the minimum acceptable level of power—most often .8 (but even that much power means that a real effect will only be “detected” statistically 80% of the time), and the effect size we want to detect.

Sample size.  Under normal circumstances, the relationship between power and sample size is so close that questions about power are almost always about what N—how many participants—are necessary to attain a desired level of power.  But power is one feature of the “null-hypothesis statistical testing game.”  We should also be aware that there are many methodologists and statisticians who advocate breaking out of this paradigm altogether, but because it is so well established in medical and services research, we probably are stuck with it to a considerable extent, at least when doing initial comparative research on innovative programs. The framework of probabilities and distributions and tests of hypotheses and probabilities of error specified under the null-hypothesis statistical test paradigm is the source of the concern about power, and “N” is one key piece.  But the other pieces of the framework also play key roles in determining power.

Alpha level.  Alpha is “set” by the researchers designing the study.  It is defined as the probability of detecting a spurious effect (“false positive”) where in fact none exists.  (In the vivid, poetic, and mnemonic language of statistics, this is called “Type I” error.)  Power is by definition the complement of beta.  Beta is defined as the probability of committing a “Type II error”, or falsely concluding that the groups being compared are not different.  Specifying a smaller alpha indicate less willingness to risk “false positives.”  However, for a given sample size, and given expectations about likely “effect sizes” (discussed in more detail later), smaller alphas mean less power to detect real effects.  Conversely, larger alphas, while running greater risks of false positives, make “statistical significance” easier to attain, so that there are fewer false negatives.

Some of these technical factors, including alpha, are pretty much fixed, in practice, by convention, although they would be free to vary if we determined not to be bound by the conventions of the kind of research we are doing.  Alpha is most often set at .05 (5% chance of a false positive) but that’s a risk level that’s theoretically within the power of a researcher or group of researchers to set otherwise, at whatever they determine to be an appropriate level for the decision-making context of the study.  If we are willing to accept a greater risk of a “false positive” conclusion in favor of a difference, we can set alpha higher, and reduce the probability of committing the other error—increase power. 

In fact, some of the leading writers on power in program evaluations (Lipsey, 1990, 1994; see also Posavec, 1998; Schneider and Darcy, 1984)] urge just this—that power can and should be increased on occasion by increasing alpha, accepting a greater risk of falsely commending a program.  This may be the case, for example, when prior research has generally shown similar programs to be effective, or when the consequences of missing a real program effect are judged to be very costly.  One point of view is that, if both types of error are equally serious on the basis of real-world subject-matter concerns, efforts should be made to adjust the design so that alpha and beta, the probabilities of both types of error, will also be approximately equal—and one way to do this is to increase alpha.  But so much research on services and programs uses the .05 convention for alpha, and maintaining both alpha and beta at low levels requires large sample sizes and effect sizes, so we are probably stuck with alpha=.05 and some other level of beta, unless we can provide a strong justification for setting alpha at a higher level.  Power will probably have to be enhanced in other ways.  

Effect Size (“the problematic parameter”)

The best way to make clear why other factors in the design and implementation of our study will have effects on power, on the ability to find the effects we’re looking for, may be to look at the one other piece of the power framework, the thing we’ve been calling the “effect size” without further explanation of what we mean.  For a COSP outcome measured an effect size is just:

MeanCOSP - MeanTMHS   

________________________

Variation in the outcome measure

(This is not the strictly mathematical form, in which the denominator is actually one or another “estimate of the population standard deviation” on the outcome measure, but the words may help to keep the importance of the amount of observed variation clearly in mind.)  Note that the effect size  expresses the differences between two programs on an outcome measure as a proportion of the total variation on the outcome: It’s a way of saying how much (what percentage) of all the differences between people in the study on any particular outcome (say “empowerment”) is there between an “average person” in a COSP and an average person who is receiving only TMHS?  (Another, less human way of describing an ES is a “signal to noise ratio’, where the differences in outcomes caused by the program are the signal, and all unrelated variations in participants’ outcomes, attributable to all factors other than the differences between the programs, are the “noise”.  You could also think of  this as a  “program effect-to-individual differences ratio.”)

It is evident that anything we do in designing or implementing a study of a program that either increases the average outcome differences between the programs or services being compared, or decreases the variation among participants, will tend to increase the effect size, and thereby the power to detect the difference.  More reliable measures and better standardization of procedures for administering and rating them, for example, will tend to yield less variation in individual scores (“less noise”), meaning that the denominator will be smaller, so the effect size will tend to be larger.  Including a more heterogeneous group of participants in the study also generally means that the there will be more variation in outcomes—a bigger denominator—and smaller effect sizes.

The Power Game: Attempting to Optimize Power

Designing studies, picking instruments for a common protocol, deciding on eligibility criteria, are all moves in a serious game of ”optimizing statistical power.”  The prize is the ability to be fair to the programs in the data analysis phase.  Because the game consists of problem solving under uncertainty and the art of the feasible, it involves trying to come up with acceptable tradeoffs, things you can do that can be credibly argued to give the study the best chance of statistically detecting program effects that are really “out there.”  Holding all but two of the parameters in the power relationships constant, one can work various tradeoffs.  In the most one familiar case, one can vary  N, and observe the effects on power using tables or charts in books, or one of the many computer programs available for determining power.  But it is also instructive to imagine varying some of the other factors, such as alpha level or the effect size, to see what would have to change to maintain the desired power.

Of the things you can do, within the framework, increasing N stands out.  (One of the commonly used power handbooks is called “How many subject?)  All others are about moving the means farther apart or reducing the size of that denominator, or both. 

It is usual to focus on “N” the number of people participating in the study. If you look at power charts in books, it’s clear that virtually any level of power can be attained to detect virtually any effect size, if only the N is large enough.   But the problems with that focus are that increasing the number of participants can be prohibitively costly, and may well be impossible.  Apart from the limited resources and time available for doing interviews and following-up participants, only a limited number of potential participants may be available in each site.  Still, within these limits some sites may be able to use service enhancement money to increase N, the number of consumers they bring into the study, if they are worried about power after discussion of power analyses. 

There has not yet been much discussion in the committees or the lists of sample sizes or number of participants.  That may change.  Other discussions and work on the common protocol along the way have focused on some of the other factors that make a difference in power.

Outcome measures. 

Some measures are more responsive to change than others, especially in the ranges we think are likely to be represented in the study. Even a measure of a construct like empowerment or hope that can be argued to have considerable validity may or may not be sensitive to change over the course of the study.  Efforts will be made to address this issue in analysis, but it should be generally understood that this was one reason for trying to include longer rather than shorter versions of scales, and to avoid categorical measures, which allow fewer possibilities for recording incremental change.  Some measures may also have problems with “floor” or “ceiling” effects, meaning that most participants may be already be so close to one end of the scale at baseline that further improvements do not register at follow-up. Again it will be essential to look at these possibilities during analyses, but some of the wrangling over measures should be understood to have been about trying to ensure that those that were chosen would be sensitive to change. Limits to change are limits on effect sizes that involve change, and therefore reduce power to detect differences between groups.

Concern was also expressed about the reliability of the various scales in the common protocol and there will be attempts to assess reliability in the pilot and after.  Greater unreliability means more “noise” in the form of variability in the denominators of effect sizes that is not related to program effects.  Although some unreliability is properly described as being a property of the instruments adopted in the common protocol, there are also some things that can be done to reduce the irrelevant variability in responses. Training interviewers to administer the protocol in similar ways and monitoring interview procedures for consistency both within and between sites will tend to reduce irrelevant variability, increase effect sizes, and therefore enhance power to detect differences.

Up ] S.P.I.C.S Page 2 ]

Missouri Institute of Mental HealthBullet5400 Arsenal StreetBulletSt. Louis, Missouri 63139
BulletPhone: 314-644-8787 BullletFax: 314-644-8834