Carol L. Richards1, Sharon Wood-Dauphinee2 and Francine Malouin1
1Department of Rehabilitation and Centre for Interdisciplinary Research in Rehabilitation and Social Integration, Laval University, Quebec City and 2School of Physical and Occupational Therapy, McGill University, Montreal, Quebec, Canada
The objective of this chapter is to give an overview of basic principles guiding the development and application of outcome measures. This chapter starts by introducing basic concepts related to the development of outcome measures, and demonstrates their applications in stroke rehabilitation. Specifically, the first section includes a theoretical discussion of reliability, validity and responsiveness, and how to approach interpretation. This discussion is based on classical test theory. For each property the theory is applied to development of a measure of "participation". Future trends related to computer-adapted testing (CAT) are also briefly described. In the second section, the evaluation of "walking competency" after stroke is used to illustrate the selection of appropriate outcome measures for this population, as well as their relation to "participation". These measures include self-reported scales, performance-based ratings and laboratory assessments. This chapter concludes with examples of how laboratory-based gait assessments and measures of brain reorganization help explain changes in clinical scales and performance-based measures.
"Outcomes are the end results of medical care: what happened to the patient in terms of palliation, control of illness, cure or rehabilitation" (Brook et al.,
1976). Outcome measures used in stroke rehabilitation can be divided into three categories: scales that assess constructs such as function, mobility and quality of life, performance-based measures that evaluate such areas as gait speed and upper limb dexterity, and measures of brain plasticity. This chapter introduces basic concepts related to the development of outcome measures and demonstrates their applications in stroke rehabilitation. To avoid a redundant review of traditional scales (Wood-Dauphinee et al., 1994; Finch et al., 2002), a section on how to create an evaluative scale that measures "participation" is provided. This information will enable the reader to critique published scales when selecting them for clinical practice or research. In the second part, the evaluation of "walking competency" after stroke is used as an example to illustrate the selection of appropriate outcome measures for this population as well as their relation to participation. This chapter ends with discussions as to how laboratory-based gait assessments and measures of brain reorganization help explain changes in clinical scales and performance-based measures.
Measures used to evaluate the outcomes of rehabilitation for individuals with stroke generally reflect physical, psychologic or social characteristics of people. Physical attributes or abilities are most easily assessed as they are observed directly. They can be evaluated using electromechanical devices or functional status scales. Psychologic and social characteristics are more difficult to evaluate because they are concepts and often cannot be observed directly. To assess a concept, we must make inferences from what we observe and decide if, for instance, an individual is independent, depressed, motivated, receiving sufficient support or coping with life's challenges. When one cannot directly observe these concepts or behaviors, they are termed constructs (Portney and Watkins, 2000). Other constructs we might want to assess include impairment, ability/disability, community mobility, health status, self-efficacy, fitness, participation or quality of life. Such constructs are evaluated using standardized scales.
Development and testing of evaluative scales
A series of well-defined steps is necessary for the development and testing of a standardized scale (Juniper et al., 1996; Streiner and Norman, 2003). First one must decide what the instrument is to measure and the type of measure to be developed. We are interested in creating an evaluative scale (Kirshner and Guyatt, 1985) to assess "participation" of an individual or a group at baseline, and again at one or more points later on, principally to determine if change has occurred. In optimal circumstances, one would begin with a conceptual framework such as the International Classification of Functioning, Disability and Health (ICF) (World Health Organization, 2001), which proposes multiple interactions between the health condition, body functions and structures, activities, participation in life, environmental and personal factors. Such a framework names broad categories, suggests relationships among categories and anchors a measure to an extensive body of knowledge. It also provides a guide to evaluating the validity of the measure, and adds to its interpretability.
In the absence of a theoretical framework, or even in addition to it, pertinent information is obtained from a review of literature to identify existing instruments assessing similar constructs that may incorporate useful items. Ideas are also obtained from patients, significant others in their lives and involved health professionals. Individual un- or semi-structured interviews and/or focus groups are conducted with these persons to gather information that is sufficiently specific to provide the basis for scale items. For the "participation" example, one might solicit issues related to caring for oneself or one's home, moving around the home or the community, traveling, communicating and interacting with family, friends and colleagues, and engaging in work and leisure activities as well as about factors in the social or physical environments that impede or assist "participation". The overall goal of this step is to amass a large number of ideas that will form the basis for writing many items that represent the construct of interest.
The goal of this step is to select the items that are the most suitable for the scale. Knowledgeable individuals who, preferably, have not been involved in their creation must carefully review all items. Each item is initially judged qualitatively to eliminate those that are unclear. Decisions are then made about the response options and the time frame of the questions. For example, the time frame may refer to today, the past week, etc. and the response options may be dichotomous (yes/no), may contain several ordinal categories or may be a visual analog scale (VAS). A categorical scale needs descriptors for each choice and a VAS needs descriptive anchors at each end.
The items are then administered to a convenience sample of approximately 100 people who reflect the characteristics of the individuals who will be using the scale (Juniper et al., 1996). If possible, a subset of this group (30-40 people) can be asked to complete the items a second time. These data provide information useful for further item reduction. Item frequency distributions tell us about the spread of responses across the sample, and provide the first hint as to whether or not the item will be responsive. Items that are highly correlated with another item may be redundant, and items with very low correlations to other items may not belong in the scale. Items with similar means and standard deviations (SD) can be summed to provide subscale or total scores (Likert, 1952). Items with missing data may indicate that the content was unclear or offensive to the respondent. Item reliabilities may also be estimated using data from the group that completed the items twice. Kappa statistics are often used for this purpose (Cohen, 1960). If criteria for excluding items are set a priori, these data allow a quantitative approach to reducing the number of items.
A pre-test on a small sample that reflects the characteristics of subjects in the target population (Juniper et al., 1996) is conducted to reduce problems related to understanding the content, or to the format of the measure. It should be completed by five to ten subjects using the same mode of administration as intended for routine use. Following completion of the scale, subjects should be debriefed. It is currently advocated that each item be examined using cognitive testing procedures (Collins, 2003). This process ensures that respondents really understand the item and that all respondents understand it in the same way. This process may need to be repeated until no further changes are necessary.
Testing the reliability, validity and responsiveness of the new scale
The measurement properties of the scale need to be examined on a new sample of subjects with stroke.
Reliability reflects the degree to which a measure is free from random error, or in other words, the degree to which the observed score is different from the true score (Scientific Advisory Committee, 2002). Traditionally, reliability has been categorized as either reflecting internal consistency or stability. Internal consistency denotes precision, how well the items are inter-correlated or measure the same characteristic. To examine internal consistency, one administration of the scale to an appropriate sample of subjects is needed. Coefficient alpha (Cronbach, 1951) is the test statistic most often used. It provides an estimate of the amount of error that is due to the sampling of items, by assessing the correlation between the scale and another hypothetical scale selected from the same group of items at the same time.
In general, stability of a scale is examined over time (test-retest and intra-rater (including the situation in which the subject self-completes the scale twice)) or between raters. Two administrations of the scale are required. The interval between administrations should be sufficiently short to minimize changes in the subjects but far enough apart to avoid the effects of learning memory or fatigue. For intra- and interrater reliability tests, rater training prior to testing is usually provided. Inter-rater reliability is most easily assessed if all raters can observe the subject simultaneously but independently, for example through the use of videotapes. Scales that require interactions between the subject and the rater may not lend themselves to this approach and multiple tests need to be conducted. In this situation issues related to the testing intervals are again important. The intra-class correlation coefficient (Cronbach, 1957) is the preferred test statistic as it assesses agreement, using estimates of variance obtained from an analysis of variance (ANOVA). Different versions, reflecting different test situations, are available (Portney and Watkins, 2000). For "participation" internal consistency and test-retest reliability would be assessed.
Reliability coefficients range between 0 and ±1.00, and are interpreted by their closeness to ±1.00. A coefficient of 0.85 tells us that the data contain 85% true variance and 15% error variance. Commonly accepted minimal standards for reliability coefficients are 0.70 for use with groups and 0.90 for use with individuals as in the clinical setting. Lower coefficients yield confidence intervals that are too wide for monitoring an individual over time (Scientific Advisory Committee, 2002).
Validity is defined as the extent to which a scale measures what it claims to measure, and it can be divided into three main types: content, criterion-related and construct (Portney and Watkins, 2000; Scientific Advisory Committee, 2002). Content validity signifies that the items contained in the measure provide a representative sample of the universe of content of the construct being measured. For instance, a measure of "participation" should contain items reflecting each of the domains listed earlier in this chapter. Content validity is usually judged by the thoroughness of the item generation and reduction processes previously described. A subjective judgment is required as there is no statistical test to assess this type of validity.
Criterion-related validity is based on the relationship between a new scale and a criterion measure, a "gold standard", which is examined at the same point in time or in the future. The most difficult aspect of this type of validation is finding a "gold standard" that reflects the same criterion and is measurable. A gold standard for "participation" is probably not available, but one might choose the reintegration to normal living index (Wood-Dauphinee et al., 1988) or the impact on participation and autonomy questionnaire (Cardol et al., 2002) as measures tapping a similar construct to evaluate concurrent-criterion validity. The new scale and one of the existing measures would be administered at the same time to an appropriate sample of community-dwelling individuals with stroke and the scores would be correlated to assess the strength of the association. Predictive-criterion validity attempts to demonstrate that a new measure can successfully predict a future criterion. For instance, a measure of "participation" could be tested in the stroke sample noted earlier to determine if it would predict institutionalization over time. This type of validation would use regression to determine if the "participation" scores obtained, for example, a month after discharge could predict the living situation over the next 2 years.
Construct validity examines evidence that a measure performs according to theoretical expectations by examining relationships to different constructs. One might hypothesize that "participation" among community-dwelling stroke survivors was positively and moderately correlated with the ability to drive a car. It might also be negatively correlated with the absence of a significant other who was willing to plan and execute activities with the stroke survivor (convergent construct validity). One could also hypothesize that the extent of functional recovery following the stroke would be more highly correlated with "participation" than would the presence of shoulder pain (divergent construct validity). These hypotheses would be tested on appropriate samples of stroke survivors by evaluating both "participation" and the other constructs at the same point in time. Known groups, another form of construct validation, examines the performance of a new measure through the use of existing external reference groups. For example, one might expect that stroke survivors living (1) at home with a willing and able caregiver, (2) at home with paid assistance or (3) in an institution would demonstrate different levels of "participation". By collecting "participation" data on each of the groups and comparing mean scores across the groups via ANOVA one can determine if the measure can discriminate as hypothesized. Finally, because the "participation" measure is to be based on the ICF framework, validity could also be evaluated by testing hypotheses that support the conceptual framework. Environmental or personal factors that assist or limit the stroke survivor in terms of participating could be explored, as could the relationships between "activities" or "body structures and functions" and "participation".
To finish this brief section on validity it is important to note that validation is never completed. One must always ask if a measure is valid for a certain population or in a specific setting. When developing a new measure most investigators select a few tests of validity according to available time and resources.
While reliability and validity are traditional psychometric properties with a long history in the social sciences, responsiveness is a relative newcomer. In fact, there has been considerable discussion as to whether it is simply another aspect of validity (Hays and Hadorn, 1992; Stratford et al., 1996; Terwee et al., 2003). Nonetheless, a taxonomy for responsiveness, that incorporates three axes (who is to be studied; which scores are to be contrasted; and what type of change is to be assessed), has been proposed (Beaton et al., 2001). The different types of change being assessed have served to categorize various definitions of responsiveness. These have been grouped as the ability to detect change in general, clinically important change and real change in the concept being measured (Terwee et al., 2003). For this chapter we have selected a definition from the third group - "the accurate detection of change when it has occurred" (De Bruin et al., 1997) because it encompasses all types of change (Beaton et al., 2001).
In addition to many definitions there are also many approaches to measuring responsiveness. Most depend on assessing patients longitudinally over time during a period of anticipated change. For example, a person with mild stroke returning home would be expected to gradually resume participation in former activities and roles. A protocol to evaluate responsiveness is strengthened when one can collect data on both: a group that is expected to change as well as one not expected to change (Guyatt et al., 1987; Tuley et al., 1991; Scientific Advisory Committee, 2002). In addition, patients who perceive positive or negative changes need to be separated from those who report no change. The various statistical approaches to quantifying responsiveness, along with their strengths and weaknesses, have been extensively presented in recent literature (Husted et al., 2000; Crosby et al., 2003; Terwee et al., 2003). Examples include the use of a paired t-test to test the hypothesis that a measure has not changed over time, calculations of effect size, standardized response mean (SRM) or Guyatt's responsiveness statistic, as well as others that are based on sample variation, and those based on measurement precision such as the index of reliable change and the standard error of measurement. The developer needs to choose one or more approaches based on the data collection protocol.
Following a stroke we expect considerable recovery during the first 6 weeks (Richards et al., 1992; Skilbeck et al., 1983; Jorgensen et al., 1995), after which time the gradient flattens somewhat but there is considerable improvement until around 6 months from the index event. At that time recovery tends to plateau although further gains are clearly possible, particularly in the functional and social realms. For testing responsiveness of the measure of "participation" one could enroll subjects upon return to their home setting and monitor them for a defined number of months. At each evaluation point, in addition to completing the scale, patients and their caregivers should independently be asked if they are "better, the same or worse" since the previous evaluation. Given sufficient subjects this will allow like respon-ders to be grouped for analysis. People saying they are better form a positive response group and those who report being worse make up a deteriorating group. As noted above, there are many statistical approaches from which to choose. To date, in addition to descriptive data, the different versions of the effect size based on sample variation are most commonly found in the literature. One should be aware, however, that because of the different methods of calculation they are not interchangeable for interpretative purposes.
Interpretability means the capacity to assign a qualitative meaning to a quantitative score (Ware and Keller, 1996). As clinicians and researchers have only a short history of trying to interpret the meaning of scale scores as compared to clinical tests, contextual relationships are required to facilitate understanding. Two basic approaches have been proposed to help interpret scalar data (Lydick and Epstein, 1993). The first, a distribution-based approach, relies on the statistical distributions of the results of a study, usually an effect size as calculated from the magnitude of change and the variability of the subjects at baseline or over time. The term effect size, coined by Cohen (1988), is a standardized, unitless measure of change in a group. Limitations of this approach include differing variability across groups being tested and the meaning of the values obtained. While Cohen suggested that 0.2, 0.5 and 0.8 represent small medium and large effects, respectively, these values are somewhat arbitrary (Guyatt et al., 2002).
The second approach, termed anchor based, examines the relationship between the change score on the instrument being tested, to that of an independent measure that is well known, associated with the measure being tested and clinically meaningful
(Guyatt et al., 2002). Population norms, severity classifications, symptom scores, global ratings of change by patients or physicians, and the minimum important difference (MID) have all been used. The MID is defined as "the smallest difference in score in the domain of interest which patients perceive as beneficial and which would mandate, in the absence of troublesome side effects and excessive costs, a change in the patient's management" (Jaeschke et al., 1989). Investigators have proposed several methods of calculating MIDs (Jaeschke et al., 1989; Redelmeier et al., 1996; Wyrwich et al., 2002), discussions of which are beyond the scope of this chapter. It is sufficient to say that the magnitude of MIDs tend to vary across measures. On a more positive note, there is some empirical evidence that in certain situations the distribution- and anchor-based approaches provide similar information (Norman et al., 2001; Guyatt et al., 2002). Neither of these approaches is without limitations but each provides a framework for investigation and contributes information important for understanding scale scores.
For the scale developer, a group of psychometric experts (Scientific Advisory Committee, 2002) has recommended that several types of information would be useful in interpreting scale scores. If the measure is to be used with stroke survivors, possibilities include data on the relationship of scores or change scores to ratings of the MID by those with stroke, their significant others or their clinicians, information on the relationship of scores on the new measure to other measures used with people who have had a stroke, comparative data on the distribution of scores from population groups, and eventually, results from multiple studies that have used the instrument and reported findings. This will increase familiarity with the measure and thus, assist interpretation.
During the past 25 years the authors of this chapter have used these steps, with modifications, to develop and test a number of measures for use with individuals who have sustained a stroke: the reintegration to normal living index (Wood-Dauphinee et al., 1988); the balance scale (Berg et al., 1989, 1992a, b, 1995; Wood-Dauphinee et al., 1997), the stroke rehabilitation assessment of movement (STREAM) (Daley et al., 1997, 1999; Ahmed et al., 2003); a fluidity scale for evaluating the motor strategy of the rise-to-walk test (Malouin et al., 2003) and the preference-based stroke index (Poissant et al., 2003).
The prior paragraphs outlined the development of a scale based on classical test theory as advocated by many psychometricians during the last century. In the 1990s, several health researchers (Fisher, 1993; Bjorner and Ware, 1998; McCorney, 1997; Revicki and Cella, 1997) recognized the need for a new approach to creating health scales, primarily because a number of limitations to the psychometric approach were known. Foremost was the difficulty of any one scale covering the entire spectrum of a construct such as physical functioning (Revicki and Cella, 1997). Indeed, a scale with a very large number of items would be needed to encompass the entire continuum of activities for people with stroke, considering that they range from having extremely severe limitations to only minor limitations and a high level of functioning. This scale would be time consuming and perhaps distressing for persons to complete as they are faced with items far beyond or below their capabilities (Cleary, 1996; McHorney, 1997). In addition, traditional scaling techniques do not allow one to separate the properties of the items from the subject's abilities (Hambleton et al., 1991; Hays, 1998). In other words, scores of the test sample determine how much of the trait is present and this, in turn, is related to test norms (Streiner and Norman, 2003). Theoretically at least, the known psychometric properties of a test apply only to those in the testing sample. This means that if the measure is to be used with others who are different, the psychometric properties need to be reexamined. Finally, a statistical disadvantage is that only an overall reliability estimate is possible with a traditionally developed scale, but we know that measurement precision changes according to the level of the subject's ability (Hays, 1998).
To address these issues and others, statistical and technical methods, namely item response theory (IRT), computer adapted testing (CAT) and computer-assisted scale administration, have been combined to create a modern approach to scale construction and delivery. To start, all known items that assess the same underlying construct (i.e. mobility or self-care or physical activities outside the home) are collected, most often from existing measures. New items reflecting missing areas may also be written. All items undergo careful analyses to examine dimensionality and item fit, and then they are calibrated onto a common ruler to create a measure with one metric. Items are further analyzed to determine if the construct is adequately represented, both statistically and clinically (Cella et al., 2004). Additional validation, field testing and analyses lead to the creation of a "data bank" of items relating to one domain. Thus, a data bank is a group of items that is able to define and quantify one construct (Revicki and Cella, 1997). Information describing the inter-item relationships as well as the relationships of responses to the various items is available. A framework for developing an item bank has been described by Cella and colleagues (2004).
To use CAT with an individual patient, a computer algorithm is employed to select the items for the subject, according to his or her level of performance. At the beginning of the test, the subject responds to a series of items and an initial classification of performance is made. For example, a person with stroke may be classified as having a low level of functional performance. After confirming this classification in a second series of items of similar difficulty, the computer algorithm adapts the third series of items to reflect the subject's level of performance, and this process continues until his or her performance stabilizes. This means that the time to complete the test is usually shortened because after determining the general level of performance, subjects are presented only with items close to their level of ability. It also means that different subjects take different versions of the same test, but results can be compared across subjects using IRT. IRT is a statistical framework related to a subject's performance on individual items and on the test as a whole, and how this performance is associated with the abilities assessed by test items (Hambleton and Jones, 1993). It is based on two assumptions: that the items tap only one construct, and that the probability of answering in a manner that reflects an increasing amount of the trait (i.e. physical function) is unrelated to the probability of answering any other item positively for subjects with the same amount of the trait (Streiner and Norman, 2003). A subject's response to each item is described by a curve, called an item characteristic curve. This curve depicts the relationship between the actual response to the item and the underlying ability, and is thus, a measure of item effectiveness. IRT places each subject on the continuum underlying the construct of interest, allowing the comparison across subjects noted previously.
Before finishing this section of this chapter, it should be noted that the review criteria (Scientific Advisory Committee, 2002), referenced frequently in the paragraphs related to classical scale development, were expanded from a previous version to reflect modern test theory principles and methods. Given the current stage of development as well as the large numbers of subjects and technical support needed to create the new scales, it is predicted that scales developed according to the methods of classical test theory will be used well beyond the next decade. Nonetheless, dynamic testing is being developed in several areas and we can expect to read more and more about it in the near future. It is important that researchers in rehabilitation are involved with research teams creating dynamic measures so that constructs useful in rehabilitation are included. Already, the US National Institutes of Health has announced two initiatives related to IRT. One is from the National Institute of Neurological Disorders and Stroke, and it is expected to create item banks for use in clinical research for people with neurologic conditions. Stroke researchers should benefit from the new technology.
1.4 Principles guiding the selection of outcomes to measure walking competency after stroke
The term walking competency (Salbach et al., 2004) is used to describe an ensemble of abilities related to walking that enables the individual to navigate the community proficiently and safely. Elements of walking competency include being able to: walk fast enough to cross the street safely (Robinett and Vondran, 1988; Perry et al., 1995), walk far enough to accomplish activities of daily living (Lerner-Frankiel et al., 1986), negotiate sidewalk curbs independently, turn the head while walking without losing balance, react to unexpected perturbations while walking without loss of stability, and demonstrate anticipatory strategies to avoid or accommodate obstacles in the travel path (Shumway-Cook and Woollacott, 1995). Thus walking competency is linked to the accomplishment of basic every-day tasks, leisure activities and participation in life.
Reliability, validity and responsiveness of gait speed
Gait speed at a comfortable pace is likely the best-known measure of walking performance (Wade, 1992; see Volume II, chapter 3). Timed walking tests (over 5, 10 or 30 m) are easy to carry out and when standardized instructions are used, the inter-rater (Holden et al., 1984; Wade et al., 1987) and test-retest reliability (Holden et al., 1984; Evans et al., 1997) of measures of walking speed in persons with stroke are high. The construct validity of walking speed is also very good. In persons with stroke, comfortable walking speed has been shown to be positively correlated to strength (r = 0.25-0.67) of the lower extremity (Bohannon, 1986; Bohannon and Walsh, 1992), to balance (r = 0.60; Richards et al., 1995), to motor recovery (r = 0.62; Brandstater et al., 1983), to functional mobility (r = 0.61; Podsiadlo and Richardson, 1991), and negatively correlated to spasticity of the lower extremity (Norton et al., 1975; Lamontagne et al., 2001; Hsu et al., 2003). Moreover, subjects who walk faster tend to have a better walking pattern (Wade et al., 1987; Richards et al., 1995).
Thus, in terms of reliability and validity, the 10-min walk (10mW) test at natural or free pace is a very good measure, but is it always the most responsive measure? For example, should maximum gait speed also be tested to assess the capacity of persons with stroke to have a burst of speed to, for example cross a busy street (Nakamura et al., 1988; Suzuki et al., 1990; Bohannon and Walsh, 1992). Others believe measuring gait speed over 5 m is enough. For the sake of argument, let us define "measure of choice" as the measure that is most responsive to change as determined by the SRM. Salbach et al. (2001) examined the responsiveness of four different timed gait tests in 50 persons, tested an average of 8 and 38 days after stroke. They found the 5-min walk (5mW) at comfortable pace test was most responsive followed by the 5mW maximum pace, the 10mW comfortable and the 10mW maximum pace. Responsiveness of a measure of physical performance cannot be generalized because it is related to stroke severity. Table 1.1 compares data from two studies, one with subjects in the early (Salbach et al., 2001) and the other in a subacute phase (Richards et al., 2004) post-stroke. It gives estimates of the magnitude of change that can be expected in clinical measures over 8 weeks.
As demonstrated by the Salbach et al. data, the 5mW was more responsive than the timed up and go (TUG) (Podsiadlo and Richardson, 1991) or the balance scale (Berg et al., 1989, 1992a, b, 1995). On the other hand, in the Richards et al. data, the Barthel index (Mahoney and Barthel, 1954) ambulation subscore was the most sensitive measure, followed by the balance scale and the TUG. In this group of subjects, walking speed was less sensitive, likely due to the more severe disability as indicated by the walking speed and TUG time. The balance scale scores, however, are comparable between groups at both evaluations and the SRMs indicate that it is the second most responsive measure for both groups.
To further examine the relation between stroke severity, as gauged by walking speed, and responsiveness, data from the sub-acute stroke subjects in the Richards et al. (2004) study were subdivided
Table 1.1. Magnitude of change over 8-week period in persons with acute and sub-acute stroke.
Measure (max. score) Baseline Post-therapy SRM Baseline Post-therapy Change Change (%) SRM
Measure (max. score) Baseline Post-therapy SRM Baseline Post-therapy Change Change (%) SRM
Was this article helpful?