Inter-Rater reliability of a professionalism Osce developed in family medicine training university of medicine and pharmacy

DISCUSSION We found a strong consistency in grading and correlation between raters’ scores for residents’ performances in the POSCE. This would suggest that POSCE is able to consistently measure the candidates’ professional behaviors across different raters. analytic rubrics in achieving high consensus among raters. When using holistic rubrics, raters are believed to use their intuition to rapidly decide which category a performance falls into [6]. However, raters still analyzed what they have observed and later, applied their personal experiences to make their assignment of scores since holistic rubric provides raters with a few of what constitutes a professional behavior4. This might increase subjectiveness, thus, cause more differences among raters in evaluation of professional behaviors [4]. Therefore, the analytic rubrics comprising case-relevant marking items and behavioral descriptors that guide the raters’ judgment might have lessened the raters’ bias and improved the inter-rater agreement. Lack of consensus among raters might reduce the raters’ consistency in assigning scores [8]. This argument has been supported by this study. It found that most differences in total scores assigned by the pairs of raters, in which one of them was the belatedly-recruited for the POSCE posttest. Given that prior to the OSCE posttest, only these raters were involved in the training. Lack of discussion to reach consensus on how to assign scores among the former and the later raters might have caused raters’ gaps despite the similar training on grading professional behaviors. This study suggests that analytic rubrics together with several features of raters’ training might improve the raters’ consistency. First, practical section should be included, in which raters are exposed to candidates’ samples of real performances to practice rating using the rubrics. Video clips can be an effective mean for practice if they clearly demonstrate performances at each mastery level on the rubrics. At the end of the practicing session, it is essential for all raters to compare their scores with the others’ for the same encounters and open discussions on the reasons for any discrepancies in scoring [8]. This can trigger a reaching consensus process, which is valuable in bridging the gaps between raters. Nonetheless, this is a cross-sectional study. It is impossible to conclude to what extent those abovementioned might be other factors that affect the inter-rater reliability such as raters’ professional backgrounds. Therefore, future studies should investigate multiple factors and their extent to which they affect raters’ consensus in rating in POSCE. Understanding these factors helps better manage the rater effects in POSCE and other performance-based assessments of professionalism.

5 trang | Chia sẻ: hachi492 | Lượt xem: 93 | Lượt tải: 0

Bạn đang xem nội dung tài liệu Inter-Rater reliability of a professionalism Osce developed in family medicine training university of medicine and pharmacy, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

MedPharmRes journal of University of Medicine and Pharmacy at Ho Chi Minh City homepage: and MedPharmRes, 2018, 2 © 2018 MedPharmRes 20 Original article Inter-Rater reliability of a professionalism OSCE developed in family medicine training University of Medicine and Pharmacy Pham Duong Uyen Binha*, Pham Le Ana, Tran Diep Tuana, Jimmie Leppinkb aUniversity of Medicine and Pharmacy HCMC; bSchool of Health professions education, Maastricht University, the Netherlands. RHFHLYHG-DQXDU\$FFHSWHG$SULO3XEOLVKHGRQOLQH$SULO Abstract: A POSCE was developed and administered in 2015 to assess six professional attributes for the Family Medicine (FM) residents, University of Medicine and Pharmacy (UMP), Vietnam. This study aims at exploring inter-rater reliability in FM POSCE developed in this context when analytic rubrics were applied. Background: Past POSCEs showed raters’ variability on applying the global marking items and holistic rating. Using analytic UXEULFV XQOLNHKROLVWLF W\SHZLOO SURYLGHPRUH UDWLRQDOH IRU DVVLJQLQJ D FHUWDLQ VFRUHPLJKW LQÀXHQFH UDWHUV¶ YDULDELOLW\1RQHWKHOHVVLWLVOLWWOHNQRZQWRZKDWH[WHQWVZLWFKLQJWRWKLVUXEULFW\SHPLJKWLQÀXHQFHWKHLQWHU rater reliability of POSCE. Methods: Before the FM professionalism module (pretest) and after this module (posttest), 36 and 42 FM residents took the POSCE respectively. The raters in the pretest included 12 teachers of FM training center. Four faculty members from different faculties were belatedly added to the post-test together with the 12 former raters. Raters’ training occurred in two different times, the former took place only for the 12 FM raters before the pretest and the latter was before the posttest for the 4 belatedly-recruited. During the POSCE, one pair of raters observed all performances per station. Inter-rater reliability was measured by the GLIIHUHQFHVLQWRWDOVFRUHVEHWZHHQUDWHUVSHUSDLUXVLQJSDLUHGWWHVWDQG3HDUVRQFRUUHODWLRQFRHI¿FLHQWResults: ,Q326&(SUHWHVWQRVLJQL¿FDQWGLIIHUHQFHZDVIRXQGEHWZHHQUDWHUV¶VFRUHVLQPRVWSDLUVRIUDWHUVFRQWUDVWLQJ with that in the posttest. Most differences were noticed in the pairs of raters, in which one of the raters was the belatedly-recruited. In the pretest, moderate to strong positive correlation between raters’ mean scores were found (r=0.55-0.85), similar range was seen in the post-test (r=0.47-0.87), however, the correlation slightly weakened. Discussion and conclusion: The FM POSCE has high inter-rater reliability on the utilization of analytic grading rubrics. An analytic rubric might help minimize the discrepancies among raters. Moreover, training raters might KDYHEHHQDQDOWHUQDWLYHLQÀXHQWLDOIDFWRURQWKHUDWHUV¶FRQVHQVXV Key words: professionalism, OSCE, inter-rater reliability, analytic rubrics, raters’ training. 1. INTRODUCTION For the last three decades, Objective Structured Clinical Exams (OSCEs) have been used for the assessment of clinical competence, medical knowledge, interpersonal, communication skills and professionalism as part of health professions education. Despite the apparent advantages of the professionalism OSCE (POSCE) over self- answered questionnaires and work-based assessments, the psychometrics of this standardized exam has been the emerging topic in the literature. The reliability of the assessment is crucial, particularly when the aim of the POSCE is to provide the rationale for the judgment of medical novices’ professionalism, as is often the case in medical school assessments [2]. Particularly, some data on reliability of POSCE that determining resident’s acquisition of professionalism have been reported in several studies [4, 9]. Inter-rater reliability is one of the most concerned estimator of reliability of *Address correspondence to this author Pham Duong Uyen Binh at University of Medicine and Pharmacy at Ho Chi Minh city, Vietnam; E-mails: binhpham2599@gmail.com DOI: 10.32895/UMP.MPR.2.1.20 MedPharmRes, 2018, Vol. 2, No. 1Inter-Rater Reliability Of A Professionalism Osce Developed 21 POSCE as making inferences from performance ratings requires the management of rater effects [1]. Findings from the past POSCEs showed inter-grader variability among different raters in grading same professional behaviors. Nonetheless, it is little known that the POSCE developed in Vietnam yields acceptable inter-rater reliability or less differences among raters. Therefore, the aim of this study was to investigate inter-rater reliability of the POSCE that was developed in the context of FM training in Vietnam. 2. METHOD The POSCE A POSCE was developed and conducted in Training Center of Family medicine, the University of Medicine and Pharmacy (UMP), Ho Chi Minh city. POSCE was administered at two different times, at the end of September, 2015 before the module of Counseling and Professionalism and at the beginning of November, 2015 in the FM orientation course. Examiners Only faculty raters are recruited for the POSCE. In the pretest, 12 faculty members who were teaching faculty with ERWKDQ0'DQG06FLQWKH¿HOGRI)0ZLWKDWOHDVW\HDUV of clinical practice and teaching were invited. In the posttest, 12 raters in the pretest and 4 belatedly-recruited raters from the unit of Preventive Medicine(UMP) were invited. All raters had not experienced in rating professionalism OSCE before. Raters’ training occurred in two different times, the former took place only for the 12 FM raters before the pretest and the latter was before the posttest for the 4 belatedly- recruited. Examination procedure All candidates rotated through six stations. In each station, FM residents interact with a Standardized Patient 63ZKRSRUWUD\VDVFULSWHGDLOPHQWRIDVSHFL¿FVFHQDULR Two raters were arranged to grade performances in a station. It was customary that during the encounter, raters completed an evaluation form that contains marking items and the 3-point rubrics which pertained to these marking items. The grading rubric comprised 3 anchors: 2-meet standard; 1-borderline; 0-below standard. Behavioral descriptors were provided in each anchor of the rubrics. Examiners’ training Raters’ training was provided before pretest and posttest consisting of four steps as follows. 6WHSDQG2YHUYLHZDQG%ULH¿QJ The raters viewed all scenarios and the scoring rubrics before training sessions. The author of cases and a content H[SHUW'LUHFWRURI)0WUDLQLQJFHQWHUFODUL¿HGDQ\GHWDLOVRI the cases, itemlists and the analytic rubrics. 6WHS )DPLOLDUL]DWLRQ ZLWK VFRULQJ FULWHULD DW HDFK PDVWHU\OHYHO Each group of raters participated in six one-hour training sessions for six scenarios. In each session, raters used the scoring rubrics to rate performances in three randomly- shown video clips. These clips intentionally demonstrated performances of three different mastery levels in each case, which was unknown to the raters. 6WHS'LVFXVVLRQ After completing their scoring, the raters compared their scores with others’ items by items. Differences in assigning score in each item to the same encounter were discussed. Differences between raters’ and the expert’s scores in the same video clip were also discussed. This enabled examiners to achieve consensus regarding what constituted below- standard, borderline or meet-standard performance of certain behaviors. Statistical analysis Descriptive statistics of the scores given by the raters were calculated using SPSS 20. Inter-rater reliability was measured by the differences in mean scores between raters using paired t-test and inter-rater agreement using Pearson FRUUHODWLRQFRHI¿FLHQW Ethical statements Informed consent All participants informed that their results will be analyzed for an evaluating study. They were also assured that WKHLU LGHQWLWLHVZRXOGEHNHSW FRQ¿GHQWLDO$OOSDUWLFLSDQWV gave verbal consent to join in the study. Their approvals were obtained on the exam days by having them sign on the registering paper before the exam. 3. RESULTS Table 1 portrays the values of paired sample T-test for each pair of raters’ total scores assigning to one performance LQSUHWHVWDQGSRVWWHVW1RVLJQL¿FDQWGLIIHUHQFHZDVIRXQG between raters’ mean scores in most pairs of raters in the 26&(SUHWHVW6LJQL¿FDQWGLIIHUHQFHVZHUHIRXQGPRVWO\LQ VFRULQJWKHVFHQDULRRI³.HHSLQJFRQ¿GHQWLDOLW\´³%UHDNLQJ bad news”, “Altruism” and “Self-awareness of limitation”. However, in the OSCE posttest, differences in mean scores between raters were found in eight out of twelve pairs. Notably, raters’ differences occur in all scenarios. MedPharmRes, 2018, Vol. 2, No. 1 Pham et al.22 Table 1: The results of the paired sample T-test for raters’ total scores in pretest and posttest Pretest Posttest T Df Sig. (2-tailed) T Df Sig. (2-tailed) Pair 1 rater1scen1 - rater2scen1 Keeping FRQ¿GHQWLDOLW\ 0.08 18 0.93 -1.31 14 Pair 1 Pair 2 rater3scen1 - rater4scen1 -8.45 23 <0.001 -2.85 15 Pair 2 Pair 3 rater1scen2 - rater2scen2 Responsibility in community -3.41 23 <0.001 -0.93 16 Pair 3 Pair 4 rater3scen2 - rater4scen2 -1.47 18 0.15 1.38 12 Pair 4 Pair 5 rater1scen3 - rater2scen3 Disclosing medical errors 0.29 23 0.77 1.05 16 Pair 5 Pair 6 rater3scen3 - rater4scen3 -7.68 17 <0.001 1.53 15 Pair 6 Pair 7 rater1scen4 - rater2scen4 Breaking bad news 6.42 19 <0.001 -3.75 15 Pair 7 Pair 8 rater3scen4 - rater4scen4 6.52 11 <0.001 4.46 16 Pair 8 Pair 9 rater1scen5 - rater2scen5 Making altruistic decision 7.54 23 <0.001 0 14 Pair 9 Pair 10 rater3scen5 - rater4scen5 -1.84 15 0.09 4.46 16 Pair 10 Pair 11 rater1scen6 - rater2scen6 Admitting limitation -6.28 23 <0.001 4.17 19 Pair 11 Pair 12 rater3scen6 - rater4scen6 -8.51 17 <0.001 1.68 14 Pair 12 Table 2: Paired Samples Correlations in pretest and posttest Pretest Posttest Correlation Sig. Correlation Sig. Pair 1 rater1scen1 & rater2scen1 0.69 0.004 0.65 0.002 Pair 2 rater3scen1 & rater4scen1 0.55 0.03 0.05 0.78 Pair 3 rater1scen2 & rater2scen2 0.57 0.02 0.75 0.00 Pair 4 rater3scen2 & rater4scen2 0.79 0.00 0.82 0.00 Pair 5 rater1scen3 & rater2scen3 0.78 0.00 0.87 0.00 Pair 6 rater3scen3 & rater4scen3 0.81 0.00 0.67 0.002 Pair 7 rater1scen4 & rater2scen4 0.84 0.00 0.68 0.001 MedPharmRes, 2018, Vol. 2, No. 1Inter-Rater Reliability Of A Professionalism Osce Developed 23 Table 2 presents the correlation between raters’ total scores. Moderate to strong positive correlation between raters’ mean scores were found in the pretest. Mean scores between raters in most pairs were strongly correlated, except in pair two and three where there is a positive moderate correlation between raters’ mean scores. In the posttest, mean scores between raters in the other pairs were strongly correlated. However, very weak correlation was also found between raters’ mean scores in pair two. 4. DISCUSSION We found a strong consistency in grading and correlation between raters’ scores for residents’ performances in the POSCE. This would suggest that POSCE is able to consistently measure the candidates’ professional behaviors across different raters. 7KH¿QGLQJIURPWKLVVWXG\LPSOLHGWKHLPSRUWDQWUROHRI analytic rubrics in achieving high consensus among raters. When using holistic rubrics, raters are believed to use their intuition to rapidly decide which category a performance falls into [6]. However, raters still analyzed what they have observed and later, applied their personal experiences to make their assignment of scores since holistic rubric provides raters with a few of what constitutes a professional behavior4. This might increase subjectiveness, thus, cause more differences among raters in evaluation of professional behaviors [4]. Therefore, the analytic rubrics comprising case-relevant marking items and behavioral descriptors that guide the raters’ judgment might have lessened the raters’ bias and improved the inter-rater agreement. Lack of consensus among raters might reduce the raters’ consistency in assigning scores [8]. This argument has been supported by this study. It found that most differences in total scores assigned by the pairs of raters, in which one of them was the belatedly-recruited for the POSCE posttest. Given that prior to the OSCE posttest, only these raters were involved in the training. Lack of discussion to reach consensus on how to assign scores among the former and the later raters might have caused raters’ gaps despite the similar training on grading professional behaviors. This study suggests that analytic rubrics together with several features of raters’ training might improve the raters’ consistency. First, practical section should be included, in which raters are exposed to candidates’ samples of real performances to practice rating using the rubrics. Video clips can be an effective mean for practice if they clearly demonstrate performances at each mastery level on the rubrics. At the end of the practicing session, it is essential for all raters to compare their scores with the others’ for the same encounters and open discussions on the reasons for any discrepancies in scoring [8]. This can trigger a reaching consensus process, which is valuable in bridging the gaps between raters. Nonetheless, this is a cross-sectional study. It is impossible to conclude to what extent those abovementioned IDFWRUV LQÀXHQFHGWKH LQWHUUDWHUUHOLDELOLW\0RUHRYHU WKHUH might be other factors that affect the inter-rater reliability such as raters’ professional backgrounds. Therefore, future studies should investigate multiple factors and their extent to which they affect raters’ consensus in rating in POSCE. Understanding these factors helps better manage the rater effects in POSCE and other performance-based assessments of professionalism. 5. CONCLUSION FM POSCE can is able to consistently measure the candidates’ professional behaviors across different raters. Using analytic rubrics and features of raters’ training which facilitates raters’ practice of rating and discussion on discrepancies in scoring among raters might help improve the inter-rater reliability. REFERENCES 1. I. Bartman, S. Smee, M. Roy. A method for identifying extreme OSCE examiners. The clinical teacher. 2013;10(1):27-31. 2. 0 7 %UDQQLFN + 7 (URO.RUNPD] 0 3UHZHWW $ V\VWHPDWLF review of the reliability of objective structured clinical examination scores. Medical education. 2011;45(12):1181-9. 3. W. C. Husser. Medical professionalism in the new millenium: A physician charter. Journal of the American College of Surgeons. 2003;196(1):115-8. Pretest Posttest Pair 8 rater3scen4 & rater4scen4 0.85 0.00 0.75 0.005 Pair 9 rater1scen5 & rater2scen5 0.81 0.00 0.59 0.003 Pair 10 rater3scen5 & rater4scen5 0.85 0.00 0.80 0.00 Pair 11 rater1scen6 & rater2scen6 0.82 0.00 0.45 0.03 Pair 12 rater3scen6 & rater4scen6 0.81 0.00 0.65 0.00 MedPharmRes, 2018, Vol. 2, No. 1 Pham et al.24 4. K. M. Mazor, M. L. Zanetti, E. J. Alper, D. Hatem, S. V. Barrett, V. Meterko, et al. Assessing professionalism in the context of an objective VWUXFWXUHGFOLQLFDOH[DPLQDWLRQDQLQGHSWKVWXG\RIWKHUDWLQJSURFHVV Medical education. 2007;41(4):331-40. 5. V. T. Nhan, C. Violato, P. Le An, T. N. Beran. Cross-Cultural Construct Validity Study of Professionalism of Vietnamese Medical Students. Teaching and learning in medicine. 2014;26(1):72-80. 6. M. J. Peeters. Measuring rater judgments within learning assessments— Part 2: A mixed approach to creating rubrics. Currents in Pharmacy Teaching and Learning. 2015;7(5):662-8. 7. G. Ponnamperuma, J. Ker, M. Davis, M. Medical professionalism: teaching, learning and assessment. South-East Asian Journal of Medical Education. 2012;1(1):42-8. 8. E. Schwartzman, D. I. Hsu, A. V. Law, E. P. Chung. Assessment of patient communication skills during OSCE: Examining effectiveness of a training program in minimizing inter-grader variability. Patient education and counseling. 2011;83(3):472-7. 9. Y. Y. Yang, F. Y. Lee, H. C. Hsu, W. S. Lee, C. L. Chuang, C. C. Chang, et al. Validation of the behavior and concept based assessment of SURIHVVLRQDOLVPFRPSHWHQFHLQSRVWJUDGXDWH¿UVW\HDUUHVLGHQWV-RXUQDO of the Chinese Medical Association. 2013;76(4):186-94. Figure 1: Raters’ training process

Các file đính kèm theo tài liệu này:

inter_rater_reliability_of_a_professionalism_osce_developed.pdf