COSP Home
Archives   Calendar    Coordinating Center    Multisite Activities    
Project Description
  COSP References  Study Sites    Search Our Site   
Table of Contents
    Upcoming Meeting Agendas

______________________________________________

 

Statistical Power in the COSP Studies Page 2

Choice of analysis and Statistical Test

There has been talk about repeated measures and using ‘covariates’ to improve the power of COSP analyses.  The most common tests compare the difference between the groups on a statistic like the mean to the variability within groups or estimate of sampling error.  Some tests, such as non-parametric tests, use less of the information and have less power.  Others, such as analysis of covariance can increase power, by “removing” part of that within group variance that is attributable to stable or pre-existing characteristics of individual participants, and testing the difference between the means against the remaining or residual variation.  A participant’s score on the baseline measure is one commonly used measure of many relevant ways that people can differ.  It can be used as a “covariate”, statistically removing its effects, in a analysis that increases power over other comparisons.  In effect, a person’s outcome score is compared, not to the entire group, but only to those who had similar baseline scores.

A similar technique is “blocking” in which, say, men and women are considered distinct “blocks” of participants and are assigned randomly in equal numbers to two groups, and the analysis then estimates and then removes the effects of these “blocks” of participants before assessing the differences between groups.  It generally increases power somewhat less than including a continuously varying covariate.  Repeated measures designs are another variation on this theme.  If each person serves as his or her “own control” on several sequential measures, that person’s scores will be more similar than if they were compared to others.  Estimating effects based on change from previous scores permits a similar reduction in the estimate of sampling error against which the difference are assessed in the statistical test.

The gain in power from using these analyses can be great.  For example, a covariate that correlates .6 with an outcome can be expected to increase the effect size by almost 25%.  Several recent treatments of power issues have urged that pretest ANCOVAs or blocking be used in most comparative program research (Lipsey, 1990, 1994; Dennis 1997).  In addition to more stable characteristics and baseline values of outcomes, a measure of the total amount of key COSP ingredients experienced by a participant would also be a potential covariate, if it can be determined at the individual level.  Including individual-level service and program-participation measures would also help to provide explanations of variations in outcome — why some participants report benefitting less from the programs than others. 

Program “Strength” and “Integrity” (or “Fidelity”)

The most typical form of the effect size is based on the difference between the means on an outcome.  So, anything in the implementation of the programs during the study that degrades the effects of the innovative program (COSP) or upgrades the traditional program (TMHS) will decrease the measured effect (the difference between the means), and reduce statistical power.  Program integrity usually refers to the constancy and appropriateness of the program delivered across participants.  If some providers did not implement the planned program as fully as others working in the same model, or even if one instance of a model in a cluster happened to be a “super-realization” of a particular type of COSP, it would tend to decrease the effect size by decreasing between-group differences, and so to decrease power for a given sample size.  

However, on the “flip side,” this is the also source of the concern about doing things to “screen” for characteristics that may lead to increased likelihood of  “retention” or continued participation in the program or the study discussed below.  Specifically, an erosion of program integrity is, in effect, what happens at the individual level when there are high levels of ”non-participation” for any reason in COSPs (nonengagement in the program, individual decisions that it is not appropriate, barriers to access, etc.)  For some COSP programs (notably, the drop-in centers) there is inherent variation and self-selection for the “amount of program” they receive  but even for these, there can be efforts to stabilize what is available to participants over the course of the study.

Concerns About Design and Implementation Decisions that May Damage Power

A number of scenarios have been identified in discussions of research logistical issues as potentially leading to damaging loss of power.   These include a choice not to participate fully in the program, resulting from inappropriate selection, or  a decision to withdraw; excessive heterogeneity in participant characteristics, both within and between sites; excessive variation in “degree of COSPness” (lack of fidelity to an emerging model of what ingredients all COSPs have in common); and the variation in outcomes to be expected among the three main types of COSPs represented in the study.

Less than “full” participation in a COSP.  In early discussion of the research logistics committee, a concern was expressed that “the power of the study to detect any effect produced by a COS will be significantly compromised if only a small proportion of study participants who are assigned to the COS actually follow through and use it...  The degree of risk is especially acute for drop-in centers.  Since it appears to be generally accepted that such programs typically attract and retain only a portion of the full range of people eligible for inclusion in the COSP, it might seem reckless to proceed without carefully considering how to address the problem up front.  There seem two ways to address the issue: refining the selection criteria, and optimizing the program's approach to retention of its members over time.  After some discussion of the desirability of trying to represent the full range of consumers of TMHS who meet clinical and TMHS-utilization criteria in each site, it was agreed that this selection strategy could lead to assigning many study participants to COSPs who would decide not to participate fully in the programs and experience very little of what the programs have to offer.  Including these participants in the COSP groups for analysis would greatly “dilute” real contrasts in outcomes between those who experience more of  the typical COSP program and those in TMHS, and thereby greatly reduce the power to detect differences between the groups.   There is an agreement in principle to select those consumers who express an interest in COSP or who are judged somewhat more likely to participate based on characteristics determined in the baseline interview.

Withdrawal from the study and Loss to Follow-up.  Although there are some ways to try to estimate what the outcomes would be like for those who choose not to stay in the study or for other reasons do not complete at least one follow-up interview, the number of participants assigned to the conditions will generally be larger than the number followed up.   It is the latter number that needs to be taken into account as an “effective” N in calculating power.  The sites vary considerably in their projected rates of loss to follow-up, and it may be appropriate to discuss and clarify the assumptions underlying these divergent projections from each site, with a view to finding ways of reducing loss to follow-up where it is expected to be a problem.

Excessive Heterogeneity Within and Across Sites.  There were discussions leading to the selection paper that emphasized the increased external validity or generalizability that might be achieved by including a representative sample of consumers from each site.  The objections to this strategy tended to focus on the potentially damaging effects on the internal validity of the comparisons and the loss to power if many of this broad sample were to decide that COSPs were not appropriate for them, or for other reasons choose not to participate after assignment.  But even if efforts to engage and retain a broader sample in the COSPs were successful, including a more heterogeneous group of participants in the study could also mean that the there would be more variation in outcomes—a bigger denominator—and smaller effect sizes. 

Note that some of these paths to reduced power are conjectural.  Unlike the relationship between N and power for a given effect size, we generally don’t know how much of a degradation in effect size an imagined or feared mechanism will lead to, so it is nearly impossible to settle disagreements about whether a danger is real or important enough take into account in planning to everyone’s satisfaction.   Without bringing additional data from other studies to bear on the questions, we will have to rely on the collective judgment of the experienced researchers and consultants involved in COSP.

One exception is the quantification of the effects of nonparticipation by individuals in their assigned program, that was provided by our consultant, Jim Ware.  He indicated that if only a proportion p participate in the assigned program, the “effective N” is Np2 . A non-technical way to think about this might be that, when all followed participants are included in the analyses (and “intent to serve” analysis) those who were invited or assigned to the COSP but did not participate in it are still there in the data, as a “no-effect” contribution to the COSP average.  So, to maintain the power of the original sample size, it would not suffice just to ‘replace’ each of them with another participant.  Rather, (and very roughly) it would be necessary first to add another participant to compensate for or balance-out the presence of the person who received no program and can be expected to make no contribution to showing an effect of COSP, and then add another participant to give the originally hoped for contribution of a COSP effect to the estimate of the difference between the groups.

What Effect Sizes Can be Expected?

Power depends on the effect size to be detected.  Ideally, we would like to know what effect sizes can be expected and whether they are large enough to make a practical or policy difference?  Given a specified sample size, alpha, test, and desired power, one can say what minimum effect size can be statistically detected.  This is one form of the basic power analysis used to justify a study with a certain number of participants.  Whether the minimum detectable effect size is large enough to warrant implementing the design depends in part on whether effects that can realistically be expected to occur are at least this large, and whether effects of this magnitude are “worth detecting” in the sense of having practical significance for policy or program improvements.   If we do not want a power analysis to be an empty game of numbers a study needs not only internal validity, but meaningfulness, producing differences with practical importance.  Fortunately cost can be part of this, but only one part.   But some other outcomes may be judged to be important if the difference are big enough, even if not easy to associate with cost savings.  There are several different ways of assessing the minimum effect size that a design should be able to detect.

Actuarial.  This approach is just looking at similar research to determine what effects sizes are typically produced and judging the likelihood of generating a similar effect size against studies that are most similar.  Overall, the most commonly used criteria for judging effect sizes were suggested by Cohen (1977, 1988) after reviewing a wide range of studies in psychology and education (“social science research). Based on the distribution of effects in these studies, Cohen suggested that an ES of .20 could generally be judged a “small” effect, while .50 could be regarded as “medium” and .80 as “large.”   But he also tried to emphasize that these were crude categories, rough heuristics, put forward tentatively as “proposed conventions,” and acknowledged  that the division of  the observed distribution had no more reliable basis than his intuition (Cohen, 1988, p. 532).  He felt that they should probably not be leaned on too heavily in planning research in a particular field, and urged that the meaning of “small,” “medium,” and “large” should be thought of as relative to the specific field and research method being used in a study.  He did note that in new areas of research effect sizes are likely to be “small” (when they are not zero) because of a lack of knowledge of what to control in designs, and difficulties in developing good measures of the phenomena.  Others (have suggested that

Lipsey (1990) endorsed and reported on the first phase of a major research synthesis  implementing the suggestion that an actuarial approach to planning studies would probably work best in areas where many meta-analyses had already been conducted, summarizing the effect sizes and coding many other methodological and programmatic characteristics across many similar studies and relating them to features such as the type of comparison on which the effect size was based (Cordray and Sonnefeld, 1985).  Lipsey reviewed and classified average effect sizes from 186 meta-analyses of mental health and education research.  Lipsey and Wilson (1993) greatly extended the actuarial approach by assembling and categorizing the results of 302 meta-analyses of research in a wide variety of what they called “treatment effectiveness research.” The Lipsey and Wilson tables of average effects sizes is often referred to in justifying a design to detect a minimum effect size.  (Wilson continues to maintain a growing database of meta-analyses in a variety of program areas that can be used as a basis for planning and power analysis.)

The actuarial approach to judging effect sizes is dependent on the existence of enough comparative research judged sufficiently similar to the planned study to generate a high level of confidence that the results will be similar.  The Lipsey and Wilson review covers a wide range of programs.  But in some areas, for example studies of COSPs, we may judge that few comparative studies exist that are sufficiently similar to support claims about effect sizes to be expected in the COSP multisite.   It will then be arguable that one or another set of effect sizes give the best available estimate of what can be expected, and there may even be controversy among those familiar with the literature. 

This may be so because few existing studies will have just the type of programs being studied.

It is currently being argued that COSPs have unique features or “ingredients” that have not yet been the focus of comparative evaluations.  It may be that there are few or even no studies that bear directly on the effect sizes to be expected in this project.  In addition, it is arguable that there are even fewer studies of COSP-like programs that include the type of “value added” comparison in the COSP multisite designs, where effect sizes are generally expected to be lower than in comparisons with “no treatment”.  For example, Lipsey and Wilson list meta-analyses in the “mental health: other...” category for “innovative outpatient programs vs. Traditional aftercare” (avg. ES=.36).  This seems the closest to some features of COSPs.  Others, including  “[service] provided by paraprofessionals in mental health.. vs. untreated controls” (avg. ES=.60); “group assertion training [no comparison specified] (avg. ES=1.5.) and perhaps a few others such as “social skills training”  may be relevant in some ways in defining the range to be expected within a broader spectrum, but are clearly not close to COSPs in program features or types of comparison.  A recent meta-analysis of studies of 9 “psycho-social rehabilitation services in community support programs” (Dilk and Bond, 1996, as reported by Barton (1999)) found ESs ranging from .35 to .55, depending on outcome.   The effect sizes we might use to predict for COSPs would depend, among other factors, on the type of comparison (“value added” vs. waiting list or other “control”).  The latter generally yield larger effect sizes.  Because of this we might suggest that the lower end of this range, .35, would be a reasonable maximum ES to expect for many COSP outcomes.   Any other relevant meta-analyses or sets of studies from which ESs could easily be extracted would help the remainder of the planning process.

Translation into “Real-World” or Policy-Relevant Units.  The basic dilemma is that high power requires a large effect size, a large sample size, or both.  But many interventions in the medical world, for example, are agreed to have great practical importance, but produce only modest statistical effects, the most famous recent one being the effect of small doses of aspirin on reducing the risk of myocardial infarction.  In the context of a highly prevalent, life-threatening, and disabling disease for which very large numbers of people are at risk, the small effects in a ongoing clinical trial were judge to be so important that it would be unethical to continue the trial and deprive anyone of the potentially life saving “program”.  Obviously, what is a practically important effect depends on the type of outcome as well as many other factors of the study’s context. 

On some of the outcomes being measured in the COSP studies, it may possible for some policy makers or program planners to specify in advance how much change or growth would be substantial enough to warrant extension of COSP programs or otherwise judge them “successful”.   One way of translating ESs into more meaningful terms has been found to support these judgments.  If some criterion of success can be specified, such as being above the median of all participants in the study on some measure or group of measures defined as “important” clinical or consumer-oriented outcomes, then and effect size can be translated into the percentage of the COSP groups who are “successful,” in comparison to the percentage of the TMHS groups.   It should be easier to specify what difference in proportion of successes would be regarded as worth paying attention to in policy contexts.  [More could be added on similar efforts to provide statistical translations of ESs into more meaningful metrics, including Cohen’s U3 and Rosenthal and Rubin’s BESD.]

Comparison with a Criterion Group.  There is another way of evaluating a magnitude of effect, sometimes called the ”criterion contrast” approach, which sounds formidably jargony, but may be the clearest way of seeing how large an effect can be expected.   It involves looking at two groups that we would all agree are different enough on the outcome measure to make a difference.  One possibility for COSP might be to examine pilot or other volunteered common-protocol data provided by a group of current COSP participants selected to be “shining examples” of recovery or the beneficial effects of COSPs, and then compare their baseline scores on key outcomes to pilot data from those in traditional services.  This kind of side study as part of a multisite takes advantage of an opportunity provided by the larger pilots the design permits to increase the interpretability of the outcomes. Other studies wishing to employ new measures to evaluate innovative programs do not enjoy such an advantage.

Design Strategy to Enhance Power

Regardless of the technique used to specify the minimum effect size a study must have the power to detect, there seems to be general agreement among program evaluation methodologists  that the only responsible thing to do is to attempt to overcome the possible limitations on power in every possible way during the design phase.  This is why both some of the COSP researchers and program administrators have emphasized the desirability— or even the crucial importance—of not trying to select participants who would be representative of all traditional mental health consumers in a community who meet clinical and service-utilization criteria specified by the GFA.   Rather, they urged, the selection criteria and recruitment and orientation procedures should be used to try to enroll the subset of those “interested” in COSPs, who providers are currently referring to the programs (or who are “voting for with their feet” in self-referrals), and who would be more likely to stay.  In addition to being able to represent the benefits of more participants who have higher intensity or fuller experiences of the COSP ingredients, a set of well-implemented selection criteria based on interest and suitability might also reduce cross-site heterogeneity in participant characteristics.   Again, permitting more participant heterogeneity would reduce power as well as raising questions about the suitability of pooling data.

Similarly, attention to reliability of measures and more planning for variance reduction analyses and statistical tests will tend to enhance power.  The biggest potential gain in power may result from pooling data across sites, if it is judged feasible and appropriate. 

Where do we go from here?

The preliminary power estimates presented in the Appendix are based on the proposals submitted by individual sites, and represent a common-denominator approach to power estimates for the 8 COSP sites.  It uses the N, expected attrition, a two-tailed alpha of .05, and assumes the most common test of differences between group means, using a t or F test, and estimates the power to detect the three conventional “small” (.2) “medium” (.5), and “large” (.8) effect sizes, as well as the .35 effect size at the lower end of one possible relevant set of previous studies, referred to earlier.  In addition to individual site power estimates, it includes estimates of the power that might be obtained by pooling the sites, first three clusters by COSP type (drop-in, peer support, or education/advocacy), and the overall power that would be available if all sites were pooled.  It does not attempt to estimate the power for subgroup comparisons.  Because descriptions of the pattern of attrition or withdrawal varied in specificity about whether program withdrawal or loss to follow-up was being estimated, it makes the more optimistic assumption that the stated percentages represented loss to follow-up, and treats the remainder of the target as the “effective N”.  With no further assumptions or more differentiated analyses, it appears that some sites may have problems in achieving adequate power to detect “small” (.2) or .35 effects. 

Even the preliminary analysis suggests that, if their data can be pooled, at least two of the COSP types may be able to detect effect sizes only slightly larger than .2, and would have more than adequate power to detect effects as large as .35,  even with their currently planned sample sizes.  This level of power would probably be judged to be adequate by most standards. 

Clearly, “sensitivity analyses,” which test a variety of other assumptions about the parameters in the power equation in “what-if?” scenarios, are well warranted. The sites’ descriptions of power analyses appear to reflect a range of site-specific assumptions about planned statistical methods, expected effect sizes, effect sizes considered “important” to detect, program withdrawal, and  loss to follow-up.  Some proposals specified assumptions about these issues more explicitly while they were left more implicit by others.  Now that we have arrived at a common protocol, subject to piloting, it seems appropriate to attempt to specify more explicitly site-specific as well as cross-site expectations regarding the various determinants of the project’s statistical power. In conjunction with discussion of common ingredients of the COSPS and the way these will be measured, these issues will bear on the feasibility and desirability of pooling data both within and across COSP-type clusters.  Moreover, it is to be hoped that further discussion and specification of power-related issues within and across sites will facilitate refinement of estimates of needed sample size, development of strategies to deal with attrition, and other ways to improve the quality of both the research and programmatic enterprise.

REFERENCES

Barton, R. (1999).  Psychosocial rehabilitation services in community support systems: A review of outcomes and policy recommendations.  Psychiatric Services, 50, 525-534.

Boruch, R.F. (1997).  Randomized Experiments for Planning and Evaluation.  Newbury Park CA: Sage Publications.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd. ed.), New York: Academic Press.

Cohen, J. (1992).  A power primer. Psychological Bulletin, 112, 155-159.

Cordray, D.S., and Sonnefeld, L.J. (1985).  Quantitative synthesis: An actuarial base for planning impact evaluations.  In D.S. Cordray (Ed.), Utilizing Prior Research in Evaluation Planning.  New Directions for Program Evaluation, No. 27.  San Francisco: Jossey-Bass.

Dennis. M. (1997).  Practical power analysis for substance abuse health services research.  In K. Bryant, M. Windle, and S.G. West (Eds).  The Science of Prevention: Methodological Advances from Alcohol and Substance Abuse Research.

Dilk, MN., and Bond, G.R. (1996). Meta-analytic evaluation of skills training research for persons with severe mental illness.  Journal of consulting and clinical Psychology , 64, 1337-1346.

Lipsey, M.W.(1990) Design Sensitivity: Statistical Power for Experimental Research.  Newbury Park CA: Sage Publications.

Lipsey, M.W.(1994). Design sensitivity: Statistical power for applied experimental Research.  In L. Bickman and D Rog (Eds.) Handbook of Applied Social Research Methods.  Newbury Park CA: Sage Publications, 1990.

Lipsey, M. and  Wilson, D.  (1993).  The efficacy of psychological, educational, and behavioral treatment: Confirmation from Meta-analysis.  American Psychologist, 48(12);1181-1209.

Lipsey, M.W., Crosse, S., Dunkle, J., Pollard, J., and Stobart, G. (1985).  Evaluation: The state of the art and the sorry state of the science.  In Utilizing Prior Research in Evaluation Planning, edited by D.S. Cordray.  New Directions for Program Evaluation, no. 34.  San Francisco: Jossey-Bass.

Murray, D. (1997).  Design and Analysis of Group-Randomized Trials.  NY: Oxford University Press.

Posavec, E.J. (1998).  Toward more informative uses of statistics: Alternatives for program evaluators.  Evaluation and Program Planning, 21, 243-254.

Schneider, A.L. and Darcy, R.E. (1984) Policy implications of using significance tests in evaluation research.  Evaluation Review, 8, 573-582.

Up ]

Missouri Institute of Mental HealthBullet5400 Arsenal StreetBulletSt. Louis, Missouri 63139
BulletPhone: 314-644-8787 BullletFax: 314-644-8834