论文部分内容阅读
Notwithstanding broad utility of COPs (classroom observation protocols), there has been limited documentation of the psychometric properties of even the most popular COPs. This study attempted to fill this void by closely examining the item and domain-level IRR (inter-rater reliability) of a COP that was used in a federally funded striving readers program. A combination of reliability measures (e.g., joint-probability of agreement, Cohen’s kappa, polychoric correlation and intra-class correlation coefficients) was selected dependent upon which were appropriate given the scale of each item set. Results indicate that most items in physical environment, cognitive demand and students’ class engagement can be assessed with moderate reliability. Items in classroom climate and instructional modes yielded mixed estimates. Recommendations were provided for possible improvement of similar instruments.
Keyword: COP (classroom observation protocol), IRR (inter-rater reliability), adolescent literacy, program evaluation
Introduction
A COP (classroom observation protocol) is an instrument used to assess and measure the quality of teaching and learning in the classroom, identify how well resources and learning environment are contributing to learning and provide suggestions on areas for possible improvement and development. Notwithstanding broad potential utility of COPs, it does not suffice that the instruments simply remain consistent internally or over reasonable periods of time. Rather, for COPs to be useful for teacher PD (professional development) evaluation, it should be shown that observers of the same class session concur substantially on the degree to which the instructor’s classroom behaviors, methods and modes of interaction with students conform to a preexisting concept of what represents good teaching. In other words, observation protocols that are idiosyncratic to the observer, but not the instructor, can be limited and misleading for evaluation purposes. Unfortunately, there has been very limited documentation of the psychometric properties of even the most popular COPs currently used in evaluations of various instructional and teacher PD programs nationwide. In particular, there is little consensus about what statistical measures are best to analyze the IRR (inter-rater reliability) of this type of instruments. As funders increase demands for more rigorous government-funded evaluations of educational programs and interventions, one way that evaluators can meet these demands is by using the most appropriate statistical measures for estimating the psychometric properties of specific protocols. The present study attempts to address this issue through a critical appraisal of a COP used in a federally funded striving readers program. The COP was developed to inform implementation fidelity ratings of a school-wide PD model designed to support middle school content area teachers’ implementation of literacy strategies in ways that support the academic achievement of students who attend high poverty urban middle schools.
Background
The SRP (Striving Readers Project) under study, situated in a large high-poverty urban school district in the South, is one of the eight programs sponsored by US Department of Education to address the needs of struggling adolescent readers and includes school-wide and targeted interventions plus rigorous evaluations of each component. The CLA (Content Literacy Academy), a school-wide PD model for content area teachers, provided 180 hours of intensive training over two years to increase teachers’ knowledge about and use of research-based reading strategies to improve students’ achievement in reading and core content areas(mathematics, science, social studies and English language arts), especially for students attending high-poverty urban middle schools. The SR-COP (Striving Readers Classroom Observation Protocol) was developed by Research for Better Schools, a non-profit research and development organization in Philadelphia, as an instrument to record and rate observations of Striving Readers’ classroom lessons as a part of the evaluation of CLA. The instrument was adapted from the CETP (Collaborative for Excellence in Teacher Preparation)—a classroom observation tool developed by Lawrenz, Huffman, and Appeldoorn (2002) at University of Minnesota.
The COP items were organized into seven domains:
(1) Physical environment;
(2) Materials/technology;
(3) Classroom climate;
(4) Instructional modes;
(5) Literacy strategies;
(6) Cognitive demand;
(7) Level of student engagement.
When inter-rater agreement is low, there are usually two reasons: (1) The scale is defective; and (2) Raters need to be retrained on the rating criteria. One of the main challenges of estimating reliability for SR-COP is that the items in different domains are scaled differently. For physical environment, all five items are in a 1-4 Likert-scale; for materials/technology, there are 12 dichotomously scaled (Yes/No) items; for classroom climate, there are six categorical items in a scale of 1, 2, 3, 4 and DK (do not know); for instructional mode, literacy strategies, cognitive demand and level of student engagement, observers indicated the use of specific modes of instruction, literacy strategies, levels of cognitive demand and student engagement in each of the four 10-minute intervals of the class through transcription of detailed field notes which are then used to complete the SR-COP data matrix. Except for cognitive demand and student engagement, there may be more than one strategy (each with an associated code) that the observer can choose to describe instruction in each interval.
Methods
The SR-COP was used by 10 pairs of evaluators to collect data about classroom implementation related
acceptable psychometric properties that are used in instructional and PD program evaluation, we will leave the important questions about what works to the ill-informed advocates or opponents of education reform. It is a privilege to initiate studies of this kind to ensure that high-quality process and outcome measures are applied in government-funded evaluation projects that are intended to help the public make wise decisions.
References
Cohen, J. (1960). A coefficient for agreement for nominal scales. Education and Psychological Measurement, 20, 37-46.
Crewson, E. C. (2001). A correction for unbalanced kappa tables. SUGI (SAS Users Group International) Paper 194-26. Retrieved July 17, 2008, from http://www2.sas.com/proceedings/sugi26/p194-26.pdf
Fleiss, J. L. (1981). Statistical methods for rates and proportions. New York: John Wiley.
Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613-619.
Fleiss, J. L., & Davies, M. (1982). Jackknifing functions of multinomial frequencies, with an application to a measure of concordance. American Journal of Epidemiology, 115, 841-845.
Kendall, M. G. (1948). Rank correlation methods. Charles Griffin & Company Limited.
Lawrenz, F., Huffman, D., & Appeldoorn, K. (2002). Classroom observation videotape guide. College of Education and Human Development, University of Minnesota.
Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72-101.
Shrout, P. E., & Fleiss. J. L. (1979). Intra-class correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420-427.
Walter, S. D., Eliasziw, M., & Donner, A. (1998). Sample size and optimal designs for reliability studies. Statistics in Medicine, 17, 101-110.
Appendix A
Table A1
A Comparison of Various Measures of IRR
IRR measure Description Scale of items Pros Cons
Keyword: COP (classroom observation protocol), IRR (inter-rater reliability), adolescent literacy, program evaluation
Introduction
A COP (classroom observation protocol) is an instrument used to assess and measure the quality of teaching and learning in the classroom, identify how well resources and learning environment are contributing to learning and provide suggestions on areas for possible improvement and development. Notwithstanding broad potential utility of COPs, it does not suffice that the instruments simply remain consistent internally or over reasonable periods of time. Rather, for COPs to be useful for teacher PD (professional development) evaluation, it should be shown that observers of the same class session concur substantially on the degree to which the instructor’s classroom behaviors, methods and modes of interaction with students conform to a preexisting concept of what represents good teaching. In other words, observation protocols that are idiosyncratic to the observer, but not the instructor, can be limited and misleading for evaluation purposes. Unfortunately, there has been very limited documentation of the psychometric properties of even the most popular COPs currently used in evaluations of various instructional and teacher PD programs nationwide. In particular, there is little consensus about what statistical measures are best to analyze the IRR (inter-rater reliability) of this type of instruments. As funders increase demands for more rigorous government-funded evaluations of educational programs and interventions, one way that evaluators can meet these demands is by using the most appropriate statistical measures for estimating the psychometric properties of specific protocols. The present study attempts to address this issue through a critical appraisal of a COP used in a federally funded striving readers program. The COP was developed to inform implementation fidelity ratings of a school-wide PD model designed to support middle school content area teachers’ implementation of literacy strategies in ways that support the academic achievement of students who attend high poverty urban middle schools.
Background
The SRP (Striving Readers Project) under study, situated in a large high-poverty urban school district in the South, is one of the eight programs sponsored by US Department of Education to address the needs of struggling adolescent readers and includes school-wide and targeted interventions plus rigorous evaluations of each component. The CLA (Content Literacy Academy), a school-wide PD model for content area teachers, provided 180 hours of intensive training over two years to increase teachers’ knowledge about and use of research-based reading strategies to improve students’ achievement in reading and core content areas(mathematics, science, social studies and English language arts), especially for students attending high-poverty urban middle schools. The SR-COP (Striving Readers Classroom Observation Protocol) was developed by Research for Better Schools, a non-profit research and development organization in Philadelphia, as an instrument to record and rate observations of Striving Readers’ classroom lessons as a part of the evaluation of CLA. The instrument was adapted from the CETP (Collaborative for Excellence in Teacher Preparation)—a classroom observation tool developed by Lawrenz, Huffman, and Appeldoorn (2002) at University of Minnesota.
The COP items were organized into seven domains:
(1) Physical environment;
(2) Materials/technology;
(3) Classroom climate;
(4) Instructional modes;
(5) Literacy strategies;
(6) Cognitive demand;
(7) Level of student engagement.
When inter-rater agreement is low, there are usually two reasons: (1) The scale is defective; and (2) Raters need to be retrained on the rating criteria. One of the main challenges of estimating reliability for SR-COP is that the items in different domains are scaled differently. For physical environment, all five items are in a 1-4 Likert-scale; for materials/technology, there are 12 dichotomously scaled (Yes/No) items; for classroom climate, there are six categorical items in a scale of 1, 2, 3, 4 and DK (do not know); for instructional mode, literacy strategies, cognitive demand and level of student engagement, observers indicated the use of specific modes of instruction, literacy strategies, levels of cognitive demand and student engagement in each of the four 10-minute intervals of the class through transcription of detailed field notes which are then used to complete the SR-COP data matrix. Except for cognitive demand and student engagement, there may be more than one strategy (each with an associated code) that the observer can choose to describe instruction in each interval.
Methods
The SR-COP was used by 10 pairs of evaluators to collect data about classroom implementation related
acceptable psychometric properties that are used in instructional and PD program evaluation, we will leave the important questions about what works to the ill-informed advocates or opponents of education reform. It is a privilege to initiate studies of this kind to ensure that high-quality process and outcome measures are applied in government-funded evaluation projects that are intended to help the public make wise decisions.
References
Cohen, J. (1960). A coefficient for agreement for nominal scales. Education and Psychological Measurement, 20, 37-46.
Crewson, E. C. (2001). A correction for unbalanced kappa tables. SUGI (SAS Users Group International) Paper 194-26. Retrieved July 17, 2008, from http://www2.sas.com/proceedings/sugi26/p194-26.pdf
Fleiss, J. L. (1981). Statistical methods for rates and proportions. New York: John Wiley.
Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613-619.
Fleiss, J. L., & Davies, M. (1982). Jackknifing functions of multinomial frequencies, with an application to a measure of concordance. American Journal of Epidemiology, 115, 841-845.
Kendall, M. G. (1948). Rank correlation methods. Charles Griffin & Company Limited.
Lawrenz, F., Huffman, D., & Appeldoorn, K. (2002). Classroom observation videotape guide. College of Education and Human Development, University of Minnesota.
Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72-101.
Shrout, P. E., & Fleiss. J. L. (1979). Intra-class correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420-427.
Walter, S. D., Eliasziw, M., & Donner, A. (1998). Sample size and optimal designs for reliability studies. Statistics in Medicine, 17, 101-110.
Appendix A
Table A1
A Comparison of Various Measures of IRR
IRR measure Description Scale of items Pros Cons