Evaluating the Reliability and Validity of an English Achievement Test for Third-Year Non- major students at the University of Technology, Ho Chi Minh National University and some suggestions for changes

Rationale for choosing this topic English has already played a specially important role in the increasing development of science, technology and international relations, which has resulted in the growing needs for English language learning and teaching in many parts of the world. English has become a compulsory subject in national education in many countries, among which Vietnam has considered learning and teaching English as a major strategic tool to develop human resources, as a way to keep up with other countries. Therefore, in any level of education, from primary to university or postgraduate degree, learners must learn or want to learn English as a compulsory subject or their target to access to information technology and to find a good job. It is true that English teaching/ learning is essential for job training. Fully aware of the importance of the English language, the University of Technology, Ho Chi Minh National University has encouraged and required their students to learn it as a compulsory subject during the first three academic years. Therefore, English has been taught at the University of Technology since it was established, aiming at equipping the students with an essential tool to go deeper into the world. However, to evaluate how students acquire when they learn a foreign language, how well they use what they have been taught and at which level of English they are standing is not paid much attention to. The evaluation only counts for calculating the percentage of the number of students who pass English tests, which ; therefore, doesn’t say anything about the validity, reliability or discrimination of the tests. The results of English test are not successfully and completely employed. In addition, during the time I have worked as a teacher of English at the University of Technology, I have heard teachers and learners complaining about the English achievement test in terms of its content, its structure. As a result, the English section has decided to implement the renewal of the item bank in order to make it more valid and more reliable. Seeing the point, the author is encouraged to undertake this study entitled “Evaluating the Reliability and Validity of an English Achievement Test for Third-year Non- major students at the University of Technology, Ho Chi Minh National University and some suggestions for changes” with the intention to find out how valid and reliable the test is. More importantly, the writer hopes that the result of the study can then be applied to improve the current testing and to create a new really reliable item bank. It is also intended to encourage both teachers and learners in their teaching and learning. Design of study The thesis is organized into four major chapters: Chapter 1- Introduction presents such basic information as: the rationales, the aims, the method, the research questions and the design of the study. Chapter 2- Literature Review reviews theoretical backgrounds on evaluating a test, which includes language testing, criteria of good tests and theoretical ideas on test reliability and validity as well as achievement tests. Chapter 3- The study is the main part of the thesis showing the context of the study and the detailed results obtained from collected tests and findings in response to the research questions. Chapter 4- Conclusion offers conclusions and practical implications for the test improvement. In this part, the author also proposes some suggestions for further research on the topic.

38 trang | Chia sẻ: maiphuongtl | Lượt xem: 2227 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Evaluating the Reliability and Validity of an English Achievement Test for Third-Year Non- major students at the University of Technology, Ho Chi Minh National University and some suggestions for changes, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

hat a score in a language test is indicated by communicative language ability. Also, the language test is affected by factors other than communicative language ability. They are: Test method facets: are systematic to the extent that is uniform from one test administration to another (Appendix 1). Personal attributes: include individual characteristics such as cognitive style, knowledge of particular content areas and group characteristics such as: sex, race, and ethnic background. It is also systematic. Random factors: are unsystematic factors including unpredictable and largely temporary conditions such as his mental alertness or emotional stage and so on. Thus, a test is considered to be reliable if it possesses such ideas as: The results of one test achieved at two different times of the same candidate are coefficient. Candidates are not allowed too much freedom. Clear and explicit instructions are provided. The same test scores are given by two or three administrators. The test results measure the learners’ true ability. The reliability of a test is indicated by the reliability coefficient which is calculated by the formula as follows: (1). (Henning, 1987) (In which, Rt: reliability coefficient, N: number of items, X: Mean of all scores, SD: standard deviation of the test) Rt is expressed as a number ranging between 0 and 1.00, with r = 0 revealing no reliability and r = 1.00 indicating perfect reliability. An acceptable reliability coefficient must not be below 0.90, less than this value indicates inadequate reliability. For instance, r = 0.90 on a test means that 90% of the test score is accurate while the remaining 10% consists of standard error. If the r = 0.60, it means that only 60% of the test score is reliable and the other 40% may be caused by an error. Thus, the higher the reliability coefficient is, the lower the standard error is. The lower the standard error is, the more reliable the test scores are. Types of reliability estimates According to Henning (1987), there are several types of reliability estimates, each influenced by different sources of measurement error, which may arise from bias of item selection, from bias due to time of testing or from examiner bias. These three major sources of bias may be addressed by corresponding methods of reliability estimate: a. Selection of specific items: - Parallel Form Reliability - Internal Consistency Reliability estimates (Split Half Reliability) - Rational equivalence b. Time of testing: - Test-retest Method c. Examiner bias - Inter-rater Reliability Parallel form reliability: indicates how consistent test scores are likely to be if a person takes two or more forms of a test. A high parallel form reliability coefficient indicates that the different forms of the test are very similar which means that it makes virtually no difference which version of the test a person takes. On the other hand, a low parallel form reliability coefficient suggests that the different forms are probably not comparable; they may be measuring different things and therefore, cannot be used interchangeably. A formular for this method may be expressed as follows: (2). Rtt = r A,B (Henning, 1987) (In which, Rtt: reliability coefficient; rA,B: correlation of form A with form B of the test when administered to the same people at the same time). Internal consistency reliability indicates the extent to which items on a test measure the same thing. A high internal consistency reliability coefficient for a test indicates that the items of the test are very similar to each other in content. It is important to note that the length of a test can affect internal consistency reliability. Split-half reliability is one variety of internal consistency methods. The test may be split in a variety of ways, then the two halves are scored separately and are correlated with each other. A formula for the split-half method may be expressed as follows: (3). (Henning, 1987) (In which: Rtt: reliability estimated by the split half method; r A, B: the correlation of the scores from one half of the test with those from the other half). Rational equivalence is another method which provides us with coefficient of internal consistency without having to compute reliability estimates for every possible split half combination. This method focuses on the degree to which the individual items are correlated with each other. (4). Kuder-Richardson Formular 20 (Henning, 1987) Test-retest reliability indicates the repeatability of a test scores with the passage of time. This estimate also reflects the stability of the characteristics or constructs being measured by the test. The formula for this method is as follows: (5). Rtt = r 1, 2 (Henning, 1987) (In which: Rtt: the reliability coefficient using this method; r1, 2: the correlation of the scores at time one with those at time two for the same test used with the same person). Inter-rater reliability is used when scores on the test are independent estimates by two or more judges or raters. In this case reliability is estimated as the correlation of the ratings of one judge with those of another. This method is summarized in the following formula: (6). (In which Rtt: Inter-rater reliability, N: the number of raters whose combined estimates form the final mark for the examinees, rA, B: the correlation between the raters, or the average correlation among the raters if there are more than two). To improve the reliability of a test is to become aware of test characteristics that may affect reliability. Among these characteristics are test difficulty, discriminability, item quality, etc. Test difficulty: is calculated by the following formular: (7) (In which, p: difficulty, Cr: sum of correct responses, N: number of examinees) According to Heaton (1988: 175), the scale for the test difficulty is as follows: p: 0.81-1: very easy (the percentage of correct responses is 81%-100%) p: 0.61-0.8: easy (the percentage of correct responses is 61%-80%) p: 0.41-0.6: acceptable (the percentage of correct responses is 41%-60%) p: 0.21-0.4: difficult (the percentage of correct responses is 21%-40%) p: 0-0.2: very difficult (the percentage of correct responses is 0-20%) Discriminability The formula for item discriminability is given as follows: (8) (In which, D: discriminability, Hc: number of correct responses in the high group, Lc: number of correct responses in the low group). The range of discriminability is from 0 to 1. The greater the D index is, the better the discriminability is. The item properties of a test can be shown visually in a table as below: Table 2.2 Item property Item property Index Interpretation Difficulty 0.0-0.33 0.33-0.67 0.67-1.00 Difficult Acceptable Easy Discriminability 0.0-0.3 0.3-0.67 0.67-1.00 Very poor Low Acceptable (Henning, G., 1987) This index sets a ground for remarking the difficulty and discriminability in the final achievement test that was chosen by the author. 2.3.2 Test Validity It should be noted that different scholars think of validity in different ways. Heaton (1988: 159) also provides a simple but complete definition of validity as “the validity of a test is the extent to which it measures what it is supposed to measure”. Hughes (1989: 22) claimed that “A test is said to be valid if measures accurately what it is intended to measure”. It is taken from the Standards for Educational and Psychological Testing (1985: 9) that “Validity is the most important consideration in test evaluation. The concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences from the test scores. Test validation is the process of accumulating evidence to support such inferences”. Thus, to be valid, a test needs to assess learners’ ability of a specific area that is proposed on the basis of the aim of the test. For instance, a listening test with written multiple-choice options may lack validity if the printed choices are so difficult to read that the exam actually measures reading comprehension as much as it does listening comprehension. Validity is classified into such subtypes as: Content validity This is a non-statistical type of validity that involves “the systematic examination of the test content to determine whether it covers a representative sample of the behavior domain to be measured” (Anastasi & Urbina, 1997: 114). A test has content validity built into it by careful selection of which items to include. Items are chosen so that they comply with the test specification which is drawn up through a thorough examination of the subject domain. Content validity is very important in evaluating the validity of the test in terms of that “the greater a test’s content validity, the more likely it is to be an accurate measure of what is supposed to measure” (Hughes, 1989: 22). Construct validity A test has construct validity if it demonstrates an association between the test scores and the prediction of a theoretical trait. Intelligence tests are one example of measurement instruments that should have construct validity. Construct validity is viewed from a purely statistical perspective in much of the recent American literature Bachman and Palmer (1981a). It is seen principle as a matter of the posterior statistical validation of whether a test has measured a construct that has a reality independence of other constructs. To understand whether a piece of research has construct validity, three steps should be followed. First, the theoretical relationships must be specified. Second, the empirical relationships between the measures of the concepts must be examined. Third, the empirical .evidence must be interpreted in terms of how it clarifies the construct validity of the particular measure being tested (Carmines & Zeller, 1991: 23). Face validity A test is said to have face validity if it looks as if it measures what it is supposed to measure. Anastasi (1982: 136) pointed out that face validity is not validity in technical sense; it refers, not to what the test actually measures, but to what it appears superficially measure. Face validity is very closely related to content validity. While content validity depends on a theoretical basis for assuming if a test is assessing all domains of a certain criterion, face validity relates to whether a test appears to be good measure or not. Criterion-related validity Criterion-related validity is used to demonstrate the accuracy of a measure or procedure by comparing it with another measure or procedure which has been demonstrated to be valid. In other words, the concept is concerned with the extent to which test scores correlate with a suitable external criterion of performance. Criterion-related validity consists of two types (Davies, 1977): concurrent validity, where the test scores are correlated with another measure of performance, usually an older established test, taken at the same time (Kelly, 1978; Davies, 1983) and predicative validity, where test scores are correlated with some future criterion of performance (Bachman and Palmer, 1981). 2.3.3 Reliability and Validity Reliability and validity are the two most vital characteristics that constitute a good test. However, the relationship between reliability and validity is rather complex. On the one hand, it is possible for a test to be reliable without being valid. It means that a test can give the same result time after time but not measure what it was intended to measure. For example, a MCQ test could be highly reliable in the sense of testing individual vocabulary, but it would not be valid if it were taken to indicate the students’ ability to use the words productively. Bachman (1990: 25) says “While reliability is a quality of test scores themselves, validity is a quality of test interpretation and use”. On the other hand, if the test is not reliable, it cannot be valid at all. To be valid, as for Hughes (1988: 42), “a test must provide consistently accurate measurements. It must therefore be reliable. A reliable test, however, may not be valid at all”. For example, in a writing test, candidates may be required to translate a text of 500 words into their native language. This could well be a reliable test but it cannot be a valid test of writing. Thus, there will always be some tension between reliability and validity. The tester has to balance gains in one against losses in the other. 2.4 Achievement test Achievement tests play an important role in the school programs, especially in evaluating students’ acquired language knowledge and skills during the course, and they are widely used at different school levels. Achievement tests are known as attainment or summative tests. According to Henning (1987: 6), “achievement tests are used to measure the extent of learning in a prescribed content domain, often in accordance with explicitly stated objectives of a learning program”. These tests may be used for program evaluation as well as for certification of learned competence. It follows that such tests normally come after a program of instruction directly. Davies (1999: 2) also shares an idea that “achievement refers to the mastery of what has been learnt, what has been taught or what is in the syllabus, textbook, materials, etc. An achievement test therefore is an instrument designed to measure what a person has learnt within or up to a given time”. Similarly, Hughes (1989: 10) said that achievement tests are directly related to languages courses, their purpose being to establish how successful individual students, groups of students, or the courses themselves have been in achieving objectives. Achievement tests are usually carried out after a course on a group of learners who take the course. Sharing the same idea about achievement tests with Hughes, Brown (1994: 259) suggests: “An achievement test is related directly to classroom lessons, units or even total curriculum”. Achievement tests, in his opinion, “are limited to a particular material covered in a curriculum within a particular time frame”. There are two kinds of achievement tests: final achievement test and progress achievement test. Final achievement tests are those administered at the end of a course of study. They may be written and administered by ministries of education, official examining boards, or by members of teaching institutions. Clearly the content of these tests must be related to the courses with which they are concerned, but the nature of this relationship is a matter of disagreement amongst language testers. According to some testing experts, the content of a final achievement test should be based directly on a detailed course syllabus or on the books and other materials used. This has been referred to as the syllabus-content approach. It has an obvious appearance, since the test only contains what it is thought that the pupils have actually encountered, and thus can be considered, in this respect at least, a fair test. The disadvantage of this type is that if the syllabus is badly designed, or the books and other materials are badly chosen, then the results of a test can be very misleading. Successful performance on the test may not truly indicate successful achievement of course objectives. The alternative approach is to design the test content directly on the objectives of the course, which has a number of advantages. Firstly, it forces course designers to elicit course objectives. Secondly, pupils on the test can show how far they have achieved those objectives. This in turn puts pressure on those who are responsible for the syllabus and for the selection of books and materials to ensure that these are consistent with the course objectives. Tests based on course objectives work against the perpetuation of poor teaching practice, a kind of course-content-based test, almost as if part of a conspiracy fail to do. It is the author’s belief that test content based on course objectives is much preferable, which provides more accurate information about individual and group achievement, and is likely to promote a more beneficial backwash effect on teaching. Progress achievement tests, as the name suggests, are intended to measure the progress that learners are making. Since “progress” in approaching course objectives, these tests should be related to objectives. These should make a clear progression toward the final achievement tests based on course objectives. Then if the syllabus and teaching methods are appropriate to the objectives, progress tests based on short-term objectives will fit well with what has been taught. If not, there will be pressure to create a better fit. If it is the syllabus that is at fault, it is the tester’s responsibility to make clear that it is there, that change is needed, not in the tests. In addition, more formal achievement tests need careful preparation; teachers could feel free to set their own ways to make a rough check on students’ progress to keep students on their toes. Since such tests will not form part of formal assessment procedures, their construction and scoring need not be purely towards the intermediate objectives on which a more formal progress achievement tests are based. However, they can reflect a particular “route” that an individual teacher is taking towards the achievement of objectives. Summary In this chapter, the writer has presented a brief literature review that sets the ground for the thesis. Due to the limited time and the volumn of this thesis, the writer wishes to focus only on evaluating the reliability and the validity of a chosen achievement test. Therefore, this chapter only deals with those points on which the thesis is carried out. CHAPTER 3: THE STUDY This chapter is the main part of the study. It provides practical background for the study and is an overview of English teaching, learning and testing at the University of Technology, Ho Chi Minh National University. More importantly, it presents data analysis of the chosen test and findings drawn from the analysis. 3.1 English learning and teaching at the University of Technology, Ho Chi Minh National University 3.1.1 Students and their backgrounds Students who have been learning at the University of Technology are of different levels of English because of their own background. It is common that those who are from big cities and towns have greater ability of English than those from the rural areas where foreign language learning is not paid much attention to. In addition, there are some students who have had over ten years of learning English before entering university, some have just started for few years, and others have never learned English before. Moreover, their entry into the University of Technology is rather low because they don’t have to take any entrance exams. Instead, they only apply their dossiers to be considered and evaluated. As a result, their attitude towards learning English in particular and other subjects, in general, is not very highly appreciated. 3.1.2 The English teaching staff The English section of the University of Technology is a small section with only five teachers. They take over teaching both Basic English and English for Specific Purpose (ESP) majoring in Computing. All the English teachers here have been well trained in Vietnam and none of them has studied abroad. One of them obtained Master Degree of English; three are doing an MA course. They prefer using Vietnamese in class, as they found it is easy to explain lessons in Vietnamese due to the limitation of students’ English ability. Furthermore, they are always fully aware of adapting suitable methods of teaching homogenous classes and they have been applying technology in their teaching ESP. This results in students’ high involvement in the lessons. 3.1.3 Syllabus and its objectives The English syllabus for Information Technology students is designed by teachers of the English section, the University of Technology, which has been applied for over five years. It is divided into two phases: Basic English (1) and ESP (2). Phase 1, which lasts three first semesters with 99 forty-minute periods, is covered by Lifelines series in which the students only pay attention to reading skill and grammar. Phase 2, including three final semesters with 93 forty-minute periods in total, is wholy devoted to ESP. It should be noted that the notion ESP in this context is simple the constitution of the English language and the contents for Information Technology. In Phase 2, the students work with Basic English for Computing which consists of twenty eight units providing background knowledge and vocabulary for computing. This book covers four skills such as listening, speaking, reading and writing and language focus. The reading texts in the course book are meaningful and useful to the students because it first revises their knowledge, language items and then supplies the students with background knowledge and source of vocabulary relating to their major - Information Technology. Table 3.1 illustrates how the syllabus is allocated to each semester. Table 3.1 Syllabus content allocation Semester 45-minute periods Teaching content Course book 1 33 Reading and grammar Lifelines Elementary 2 33 Reading and grammar Lifelines Elementary 3 33 Reading and grammar Lifeline Pre-Intermidiate 4 39 Reading and grammar and vocabulary Basic English for Computing 5 27 Reading and grammar and vocabulary Basic English for Computing 6 27 Reading and grammar and vocabulary Basic English for Computing The two course books used in six semesters include four skills: reading, writing, listening and speaking but reading and grammar are paid more attention to because of the objectives of the course. Table 3.2 Syllabus goal and Objectives COMMON GOAL To equip the students with the basic English grammar and general background of computing English necessary for their future career. OBJECTIVES Semester 1+2+3 - To revise students’ grammar knowledge and help them use the knowledge fluently in order to serve the next semesters Semester 4+5+6 - To supply the students with the basic knowledge and vocabulary of computing. - To consolidate students’ reading skills and instruct them how to do translation. In addtion, to help the students read, comprehend and translate English materials in computing. Applying teaching methods encounter a variety of difficulties such as students’ habits of passive learning, low motivation, big classes etc. A clear goal is set up by the teaching staff for the whole syllabus. The goal is realized by the specific objectives for each semester. 3.1.4 The course book: “Basic English for computing” The book was written by H. Glendinning, E. and McEwan, J and published in 2003 by Oxford University Press with some key features: * A topic-centred course that covers key computing functions and develops learners' competence in all four skills. * Graded specialist content combined with key grammar, functional language, and subject-specific lexis. * Simple, authentic texts and diagrams present up-to-date computing content in an accessible way. * Tasks encourage learners to combine their subject knowledge with their growing knowledge of English. * Glossary of current computing terms, abbreviations, and symbols. * Teacher's Book provides full support for the non-specialist, with background information on computing content, and answer key. The book was designed to cover all four skills and followed by language focus. However, because of the objectives of the ESP taught at the University of Technology, only reading skill and grammar are focused as mentioned so far. The detail content of the book can be found at Appendix 2. The book appears good with authentic and meaningful texts. The final achievement tests are often based closely on the content of the course book. 3.2 English testing at the University of Technology 3.2.1 Testing situation English tests for students at the University of Technology are designed by the staff of the English section. Each teacher from the staff is responsible for test items for each semester and then, all the materials will be fed into a common item bank that is controlled by a kind of software in a server. Before the examinations, the person who is in charge of preparing the tests will use the software to mix the test items in the item bank and print out the tests. All the tests are designed under the light of syllabus-content approach. All in all, the students are required to take six formal tests throughout their courses. Within the limited scope of the study, the writer would like to focus on the third-year final test or the sixth semester, which is the last test that the students have to do. Current English testing situation at the University of Technology has several worth-noting points as follows: Students are often instructed with the test format long before the actual test, which leads to the test-oriented learning. Students do not have test papers returned with feedbacks and corrections, so they hardly know which their strong and weak points are. Students can copy answers from one another during the tests in spite of examiner supervision, thus their true abilities are not always reflected. Some tests still contain basic errors such as spelling errors, extremely easy or difficult items, badly-designed appearance, etc. Test items are not pre-tested before live tests. 3.2.2 The current final third-year achievement test (English 6) General information: * Final Achievement Test, Semester 6, English 6 * Time allowance: 90 minutes * Testees: most of the third-year students at the University of Technology * Supervisors: teachers from the University of Technology. English test 6 is a syllabus-based achievement test whose content is taken from teaching points delivered in the three last semester 4, 5 and 6. The test covers a wide range of knowledge in computing, vocabulary, grammar, reading, writing and translation skills. Table 3.3 describes English Test 6 with seven parts and marking scale as below: Table 3.3 Specification of Test 6 Part Language items/ skills Input Task types Marking I Vocabulary and Grammar Sentences ´ 10, 4-option multiple choice 15 II Reading comprehension Narrative text relating to the computing, approx. 300-400 words × 5, 4-option multiple choice 25 III Reading and Vocabulary Narrative text relating to the computing, approx. 150-200 words × 10, open cloze 15 IV Writing Incomplete sentences × 5, sentence building 15 V Writing Incomplete sentences × 5, sentence transformation 15 VI English-Vietnamese translation Sentences in English 2 sentences 10 VII Vietnamese- English translation Sentences in Vietnamese 2 sentences 5 Total 100 (For the specific test, see Appendix 3) As explained above, the students are supposed to apply their reading skills, grammar and vocabulary in preparation for the final examination, so the test is aimed at accessing both knowledge and skills. In the first part of the test, the students have to perform their tasks with the background knowledge, vocabulary and language items relating to computing. Part 2 requires the students to read an ESP passage and then, choose the best option for each question. In part 3, the students have to choose words among the given ones to complete the text. It is also for testing the students’ reading comprehention and vocabulary. Part 4 and part 5 force the students to use their knowledge about grammar to make meaningful and correct sentences. Finally, the two last parts of translation are aimed at accessing the students’ general understanding regarding their vocabulary, their use of language and terminology. 3.3 Research method 3.3.1 Data collection instruments To analyse the data input to evaluate the reliability and the validity of the final achievement test, the author wishes to combine some instruments which are shown below: Formula 1 (as shown in Chapter 2) to compute the reliability coefficient. Solfware: Item and Test Analysis Program-ITEMAN for Windows Version 3.50 to analyze item difficulty and item discrimination, and to evaluate contruct validity. 3.3.2 Participants The study is heavily based on scores obtained from 127 test papers, which are equivalent to the total number of students taking English test 6. 3.4 Data analysis 3.4.1 Initial statistics Frequency distribution Table 3.4 Frequency distribution in the final achievement test Converted scores (x) Frequency (f) (fx) 0 1 0 2 1 2 3 3 9 4 13 52 5 31 155 6 35 210 7 35 245 8 8 64 Total 737 Median refers to the scores gained by the middle testees in the order of merits. In this case, median is 4.5 Mean = (0×1+ 2×1+3×3+4×13+5×31+6×35+7×35+8×8): 127 = 737:127= 5.8 Mode refers to the list of numbers that occur most frequently. Thus, in this set of numbers, there are two modes, which are 6 and 7 Measures of Dispersion: the standard deviation (sd) and the range Range is the difference between the highest and the lowest scores. Range = 8-0 = 8 Standard deviation (sd) is another way of showing the spread of scores. It measures the degree to which the group of scores deviates from the mean. Sd is calculated by the formula as follows: (In which X: any observed score in the sample, : the mean of all scores, N: number of scores in the sample) 3.4.2 Reliability estimates The author employs Fomula 1 to calculate the reliability coefficient (Henning, 1987) (In which, Rt: reliability coefficient, N: number of items, X: Mean of all scores, sd: standard deviation of the test) Rt is expressed as a number ranging between 0 and 1.00, with r = 0 revealing no reliability and r = 1.00 indicating perfect reliability Then the obtained reliability index is 0.53. It means that only 53% of the test score is reliable and the other 47% may be caused by an error. Thus, this value indicates inadequate reliability. 3.4.3 Item analysis To analyze the item difficulty and the item discriminability, the author employed a software named Item and Test analysis program – ITEMAN for Windows Version 3.50. The complete analysis will be attached at the Appendix 4. The author would like to present the overall proportions of the item statistics with the interpretation. Table 3.5 Item properties in the final achievement test (test 6) Item property Index Interpretation Proportions (%) Difficulty 0.0-0.33 0.33-0.67 0.67-1.00 Difficult Acceptable Easy 21 33 46 Discriminability 0.0-0.3 0.3-0.67 0.67-1.00 Very poor Rather poor Acceptable 23 51 26 Point Biserials 0.0-0.24 0.25-1.00 Very poor Acceptable 5 95 The item difficulty and item discriminability can be visually displayed as follows: It can be seen that 33% of the items are of acceptable difficulty (0.33-0.67), 64% of remaining items are either too difficult or too easy. In addition, the correlation between item responses and total scores 9 (point biserial) seem to be satisfactory. Moreover, 26% of the items shows acceptable discriminability, the other 74% is not eligible to discriminate students’ ability. Thus, the discriminability is low. After calculating all indices of item properties, extreme items are screened out and analyzed to gain an insight into the nature of problems. The items will be discussed top to bottom, part by part for easy reference with possible explanations in Table 3.6. The items given in Table 3.6 are typical among other extreme ones in Test 6. The explanations are based on the revision of the course book and the informal opinion exchanges with the teachers in the English section. After initial analysis of test items, the writer may reach the conclusion that Part VI is the most difficult because it contains only subjective items. Part II appears the easiest one. Part IV and VII are slightly difficult for the students as they require the students to perform their writing skill based on the input knowledge. Part V is of acceptable difficulty and Part II is fairly easy. Table 3.6 Extreme items with Possible Explanations Part Question Difficulty Interpretation Possible explanations I 4 0.94 Too easy Common verbs 5 0.96 Too easy Common computing definition II 1-5 0.69-0.98 Too easy Frequently practice III 8 0.88 Too easy Common phrasal verb IV 5 0.15 Too difficult Complexity of–ing clause V 1-5 0.31-0.59 Acceptable VI 1 0.20 Too difficult Vietnamese-English translation is under practised 2 0.14 Too difficult VII 2 0.13 Too difficult Difficult terms and complexity of sentence structure Part VI (Vietnamese – English Translation), which is made up of subjective items, needs separate analysis. Only 20% and 14% perform at the level of acceptability of the content of question 1 and 2 accordingly. In part VII (English – Vietnamese Translation), most students gain their scores in the first question with the percentage of 62. The other question needs a careful consideration in the content and the marking as suggested by ITEMAN 3.5. 3.4.4 Validity To judge the validity of the test, the author bases on the scale intercorrelations among parts of the final achievement test (test 6). This index is calculated with the help of the software ITEMAN 3.5. Table 3.7 Scale Intercorrelations among seven subtests (parts) of Test 6 1 2 3 4 5 6 7 1 1.000 0.293 0.366 0.414 0.187 0.150 0.256 2 0.293 1.000 0.365 0.105 0.298 -0.232 -0.069 3 0.366 0.365 1.000 0.143 0.235 -0.046 0.284 4 0.414 0.105 0.143 1.000 0.378 0.231 0.116 5 0.187 0.298 0.235 0.378 1.000 0.302 0.101 6 0.150 -0.232 -0.046 0.231 0.302 1.000 0.234 7 0.256 -0.069 0.284 0.116 0.101 0.234 1.000 3.5 Discussion and findings Basing on the above findings, the evaluation of the final achievement test (test 6) can be made in accordance with three research questions as follows: 3.5.1 Reliability The final achievement test for the third-year non-major students at the University of Technology appears inadequate reliability with the reliability coefficient index of 0.53 as found in Section 3.4.2. Moreover, the item difficulty index and the item disciminability index also indicate the unreliability of Test 6. Knowing that the test is for both good and less competent students in the group and aims at checking whether and how much students have acquired the basic knowledge as well as skills discussed in the textbook; however, the difficulty level of the test as explored is deemed to be inappropriate. Based on Figure 3.1: Histogram of Score distribution and Table 3.5: Item properties in the final achievement test (test 6), we can come to a conclusion that the examinees (46%) found the test too easy and a high percentage of them obtained a perfect score. In addition, this is an achievement test, not a placement test so it should not be required that all the questions must be of acceptable discriminability level. Nevertheless, in the case of this test, this kind of questions accounts for 26%, which is not proper to discriminate between the more able and the less able students while still do not hinder the less competent from completing the course with some satisfaction instead of discouragement. Therefore, if the discriminability is taken into consideration, more focus will be placed in the item writing process. Items with poor discrimination value should be removed and replaced. 3.5.2 Validity In terms of content validity, the final achievement test appears valid. The test contains items which are chosen to comply with the test specification as shown above. The test is closely related to the proposed aims with the first and the third parts are to test grammar and vocabulary, the second part is about writing skill, Part IV and V are supposed to test some writing skills with sentence building and sentence transforming exercises. The last two parts aim at translation skill. However, in terms of construct validity, the test seems to lack of it. It can be seen from Table 3.6: Scale Intercorrelations among seven subtests (parts) of Test 6 that seven parts of the test are not highly correlated to each other. The range of intercorrelation scales is expanded from -0.232 to +0.414. It can be recognized from the table that there is a stronger correlation between Part I and Part IV. In contrast, Part 2 does not well correlate with Part VI. It may be due to Part 1 contains more easy items than Part VI; in addition, as analyzed in Section 3.4.3, Part VI is the most difficult part in the test, but it shows better correlation with Part V. Part VI and VII are lack of correlation with the other parts, which is because the translation tasks (Part VI and VII) are not associated with the teaching objectives and teaching content. Although the students are taught via lessons in class, they are not intentionally taught translation skills. What they have done is orally translating reading texts into Vietnamese for better understanding. That’s why they perform English – Vietnamese translation better than Vietnamese-English translation. As far as the item properties are concerned, the final achievement test or test 6 has made up of many weak items which are considered as those of abnormal indices of difficulty and discriminability, or those which fail to measure what is intended. The analysis in Section 3.4.3 reveals that only 33% of the items are acceptably difficult, 67% are either too difficult or too easy. These extreme items are not scattered through the test but most are gathered into groups like easy items in Part II and III, difficult items in Part VI and VII. Besides, half of the items are of poor discrimination that fail to discriminate test-takers’ levels. In conclusion, the final achievement test is not valid because it has over half of the items weak and has weak correlation among parts. Furthermore, it also contains some parts that inefficiently measure what is intended. 3.5.3 Final evaluations Unfortunately, the final achievement test for the third-year non-major students appears inadequate reliability with the rather low reliability coefficient (0.53) and low difficulty and discriminability index and is not valid enough with many weak items and unbalanced parts. These shortcomings can be improved through some changes as recommended in the Conclusion. 3.5.4 Conclusion The quantitative analysis reveals that the final achievement test is not an efficient assessment tool. In practice, the administrators and teachers only base on the initial statistics of scores to evaluate the test exactly as what have done in response to the first two research questions. The study shows that the test is neither reliable nor valid. This problem calls for a need of revising the final achievement test and improving the testing situation for students at the University of Technology. CHAPTER 4: CONCLUSION 4.1 Conclusion and Implications The study aims at evaluating the reliability and the validity of the final achievement test for the third-year students at the University of Technology. To deal with the stated aims, the study was conducted into three phases logically. First, the writer gathered and read the materials on background knowledge critically to set the basis for the study. Then, the context for the study was mentioned and discussed to make a relevant cue for the study. Finally, the writer collected the data, analysed and evaluated the data in order to find answers to the research questions raised at the very beginning of the study. The study answered the first two research questions Also, it has reached a number of important conclusions about the assessment of the test reliability and validity. These conclusions are as follows: The test with chosen items is compatible with the aims of the courses and the objectives of the syllabus and the test specification. However, with the negative skew of score distribution analysed above, along with the evaluation on item difficulty and discriminability and the low reliability coefficient of 0.6, the test is of inadequate reliability. Also, it cannot discriminate students’ true ability correctly. The test is; moreover, proved not to be saticfactory enough under closer investigation because: It has unbalanced components in terms of difficulty. Over 70% of the items are weak. Weak items are either too difficult or too easy. Several parts in the test are not in association with teaching objectives and contents. It contains two parts which are very subjective. That is the translation task which varies too much on the markers and the administrators. It also finds that the marking of the test is not consistent in rating subjective items and rounding rules. It is true for the writing and translation tasks. For example, two test papers of 5.5 were rounded either up to 6 or down to 5. For the writing parts, sentence building or transforming, they test students’knowledge on grammar rather than communication skills. The marking for these tasks is only given in the case that the whole sentence is perfectly correct; otherwise, it will be marked zero (0). As a result, these parts are always rumoured as difficult parts. These problems may result from a simplified process of test development, administration and rating process. In response to above-given problems of the final achievement test, the writer recommends some suggestions for improvement of the test design and administration. Solution 1: Test Content The easiest way of improvement is to simplify Part VI and VII. It can be done by carefully choosing the sentences, structures and vocabulary that are more common for students. The content of the test can be improved as follows: Replacing weak items in multiple-choice and reading comprehension (Part I and II) Replacing item 5 in sentence building because it is too difficult for students. However, these changes are only applied to the chosen test, not for other final tests for the third-year students. As stated in Section 3.2.1, the used test is one of the final achievement tests that are designed from the item bank. Thus, to make some changes in the test contents needs careful and overall considerations into test items before putting them in the bank. Solution 2: Test Specifications Proposal Careful consideration must be given to the designed purpose of a test, including the exact content of objectives said to be measured and the type of examinee for whom the measurement is to occur. (Henning, G., 1987) Accordingly, the test specification should be proposed adhering to teaching objectives and text-based contents and should be made at the very early stage in test construction. The overall objectives of the course are that students can better understanding vocabulary, grammar and reading skills for the reasons that first, they are associated with teaching objectives, which are to help students read and understand ESP materials in English and second, this knowledge is necessary for students in their career. Emphasis is put on vocabulary as it is an important part of the ESP teaching content. Grammar is not measured separately but indirectly through all tasks. Tasks in the specification should put in order from easiness to difficulty to motivate the students. In addition, translation tasks should be omitted because they are believed to be not helpful that much. Solution 3: Test Administration Test administration involves a series of complicated procedures, thus forceful improvement cannot be made instantly as it entails not only reforms in testing practice but also in the whole education system. However, some practical actions can be immediately taken to enhance the test administration in short or long term. First, test format should not be introduced to the students long before any test as this leads to the test-oriented learning. Instead, the teachers should present the test format just at the end of the course or right before the test to direct students. Second, test makers should write complicated items. The best way to arrive at unambiguous item is, having drafted them, to subject them to the critical scrutiny of collegues, who should try as hard as they can to find alternative interpretations to the ones intended. (Hughes, A. 1989: 38). In the case of the English section at the University of Technology, the teachers should do the trial tests first and then, mark and remark the tests to find out weak items; lastly, to eliminate the weak items and replace with better ones. Third, the test should consist of items that permit scoring which is as objective as possible. Multiple choice items; open-ended items which has a unique, possibly one word, correct response; cloze tests, matching items etc… , as suggested by Hughes (1989: 40), may permit objective scoring. Finally, test makers should be informed of a detailed marking scale with specific rounding rules, especially for subjective marking like translation and writing. We have gone through major conclusions from the research and a number of implications for improvement which may be of certain help to the teachers at the University of Technology. The next section will discuss some limitations of the thesis and propose directions for further research. 4.2 Limitations of the study Within the limited timeframe and limited volumn of the study, the thesis possesses a number of unavoidable limitations. The limitations could be summarized as follows: The writer bases herself only on the quantitative method to evaluate the reliability and valididty of the final achievement test for the third-year non-major students at the University of Technology. Thus, the thesis lacks the feedback from test users and teachers to make a complete judgement on the used test. An intensive investigation into the test content has not yet been carried out because of the limited time. Reliability evaluation is restricted to computing the coefficient and considering several minor factors such as item difficulty or item discriminability. Validity evaluation is limited in analysing the intercorrelation among parts in the test, the part-whole relations and the intercorrelation among tests have not been mentioned. The test evaluation is restrained in evaluating reliability and validity. Other factors have not yet been concerned such as practicality, relevance, etc. The author has not proposed a more reliable and valid test after evaluating the current test because of time limit and the thesis restriction. 4.3 Suggestions for further research In order to generate a more comprehensive look at achievement tests at the University of Technology, a number of issues could be taken into accounts in further research. They are: An investigation into the test results with the overall research methods and complete relavant factors. An action research on designing an achievement test to ensure and upgrade the reliability and validity of the test. An investigation into the test administration process. An investigation into the backwash of the test. An investigation into the association between the computing content of the test and what have been taught and to which degree. An investigation into the test reflection on the syllabus. An investigation into the item banking process. To sum up, this research makes a positive contribution to the work of teachers, test-writers and test-users. It attempts to raise an awareness on developing a valid and reliable assessment instrument among teachers of English at the University of Technology. The author hopes that this research will encourage other researchers to enthusiastically engage in the field of language testing, especially in those areas identified for further research. REFERENCES Alderson, J.C., Claphan, C., Wall, D. (1995). Language Test Construction and Evaluation. Cambridge: Cambridge University Press. American Psychological Association. (1985). Standards for educational and psychological testing. Washington, DC: Author. Anastasi, A., (1982).Psychological testing . London: Macmillan Anastasi, A. and Urbina, (1997). Psychological testing. 7th ed. Prentice Hall International (UK) Ltd. Bachman, L.F. (1990). Fundermental Considerations in Language Testing. Oxford: Oxford University Press. Bachman, L.F., Palmer, A.S. (1996). Language Testing in Practice. London: Oxford Berkowitz, D., Wolkowitz, B., Fitch, R., Kopriva, R. (2000). The Use of Tests as Part of High-Stakes Decision-Making for Students: A Resources Guide for Educators and Policy-Makers. Washington, DC: U.S. Department of Education. [Available online: ]. Carmines, E. G. & Zeller, R.A. (1991). Reliability and validity assessment. Newbury Park: Sage Publications. Davies, A. (1996). Language testing. Cambridge: Cambridge University Press. Davies, A. et al.(1999). Dictionary of Language Testing. Cambridge: Cambridge University Press. Glendinning, E.H. and McEwan, J.(2003). The coursebook. Basic English for Computing. Oxford: Oxford University Press. Heaton, J.B.(1998). Writing English Language Tests. London: Longman Group UK, Ltd. Henning, G. (1989). A Guide to Language Testing. Cambridge: Newbury House Publishers. Hughes, A. (1989). Testing for Language Teachers. Cambridge: Cambridge University Press. Lado,R. (1961). Language Testing. London: Longman McNamara, T. (2000). Language Testing. Oxford: Oxford University Press. Palmer, A.S., and L.F. Bachman, (1981). Basic concerns in test validation. (in) Alderson, J.C. and A. Hughes (eds.), (1981) Vu Van Phuc (2002). Hand-outs and Lectures notes Weir, C.J.(1990). Communicative Language Testing. Prentice Hall International (UK) Ltd.

Các file đính kèm theo tài liệu này:

Thuyet minh luan van.doc