![]() | No 23, Vol. 8, 2002 |

|
|
What is happening in language testing?
|
In a muted and modified way, and with some delay, these same features have also found their way into testing and assessment practices. The delay may partly be explained by the inherent conservativeness of testing as a norm-based activity. Partly, it is also due to the teacher-assessor's decisive role in initiating and planning assessment, combined with a conscious or subconscious fear of the strong demands of correctness and fairness that assessment involves. It is easier to follow old practices and believe that they are tried and trusted than to confront one's own fears and uncertainties about assessment. However, this situation is now changing.
Underneath the mainstream emphasis on measurement and standards, changes in assessment theory and practice have been going on in the past decades, and similarly to teaching, they have concerned all areas of approach, philosophy, and activity (an accessible overview of the developments is given e.g. by Birenbaum 1996). Though this may not have been in the theoretical headlines thirty years ago, the situation has changed in the past ten years, and it is becoming increasingly clear that there are several diverse currents in the core of language testing today. From the narrow confines of measurement, language testing has broadened into language assessment, a field with functional links with language education, several fields of applied linguistics, and educational assessment. Apart from topics of discussion, the change is also reflected in publication titles. The main journal in the area is still called Language Testing, but the volume on these topics in a recent encyclopedia was subtitled Language testing and assessment (Clapham and Corson (Eds.) 1997), and an ongoing series of current textbooks is called the Cambridge Language Assessment Series.
In this article, I will discuss some of the changes that have been going on in language testing and assessment in the "traditional" and the newer, broader sense. The paper has a practical focus on issues and implications that are relevant for language education. However, I will also include references to recent articles and reviews, which explain particular developments in more detail. As there is a whole article on the social and political dimensions of language testing in this issue, I will not discuss the developments on this topic here regardless of its importance. Instead, I will focus on the other changes of philosophy and practice that recent developments in language testing have brought about.
Computers and the Internet provide an important context of language use for today's language learners. This is especially so in English language education, though applications, particularly specialized language learning applications, have also been developed for other languages. The exciting opportunity for language assessment that computerisation offers is the development of computer-adaptive tests and assessment instruments.
Computer-adaptive assessment tools contain pre-programmed sets of tasks, which may be organized according to topics, skills tested, and/or specific content areas such as formulaic question-response sequences or singular-plural distinctions. Additionally, the tasks are tagged with some measurement information. This most often indicates how difficult the item has been found to be in pretesting and analysis, and possibly also how effective it is in telling apart those learners whose general ability level is higher than the difficulty of the item from those whose general ability is lower (i.e., item discrimination).
The idea in computer adaptive tests is that task delivery adapts to an individual learner's responses. The learner first gets an intermediate task (or item, as they are frequently called in testing terminology). If he or she gets it right, the program delivers a harder item. If the response is correct, the program chooses a harder item still. If the learner now gets this wrong, the program gives an easier item, beginning to home in on the learner's ability level. Another learner may get the same task at the beginning, but depending on his or her pattern of responses, may see quite a different set of items. Nevertheless, the program can use the measurement information attached to the items to give a final grade to both learners on the same scale.
The statistical basis for making computer adaptive tests is most often some form of Item Response Theory (IRT). IRT models are based on likelihood of correct response, in other words, probability theory. An accessible introduction to the use of item response theory in language testing is provided in Tim McNamara's 1996 book Measuring second language performance. There are also several introductory textbooks in statistics, which explain the theory that underlies the model; the search word 'item response theory' or 'IRT' will help locate some of them in a library near you. Moreover, the basics are also explained on some sites on the Web, such as ericae.net/scripts/cat/.
The advantages of computer adaptive testing include that tests can be shorter when the extra time for answering many items that are too easy or too difficult for the learner is avoided, and that results from tests given at different times are comparable. Once a pool of items has been piloted and analysed so that the item values are known, any sub-combination of them can be given to new test takers and their ability estimated. Test security is also improved when the content of any new test is unpredictable.
However, there are also drawbacks. The most important practical one is the amount of work needed to construct an item pool for an adaptive test. The numbers of learners that are needed for piloting items properly are in the hundreds, or for some statistical models in the thousands. In other words, construction of adaptive tests is not something that every teacher can easily do.
However, if potentially useful adaptive assessment systems already exist, teachers can try them out with their learners and evaluate their usefulness. The learners are likely to find it interesting to learn how different programs score them, but they might also be asked to reflect on other aspects of the assessment process such as enjoyment and experienced usefulness for supporting learning. One system that learners might be asked to try out is DIALANG, which is going to be available through www.dialang.org in an increasing number of languages in the course of year 2002. The first languages will be available as soon as the beta version is technically steady enough for public testing. More computer-mediated assessment alternatives can be found for example through Glenn Fulcher's Resources in language testing page at www.surrey.ac.uk/ELI/ltr.html under the "links" section. The pages also contain a wealth of other links to other language testing resources on the Internet, for example videotaped explanations of basic concepts in language testing.
Besides Item Response Theory and its application in adaptive testing, there have also been several other technical and statistical developments in language testing in the past 30 years. For those who are interested in these aspects of language assessment, the developments are summarised succinctly in articles by Bachman and Eignor (1997) and Bachman (2000). Apart from information, the articles also provide further references on particular statistical techniques.
In the area of language testing theory, the most significant developments in recent decades have happened in the area of validation. Validity has always been one of the central criteria of measurement quality in testing, but in the first half of the 20th century, it used to be considered an unproblematic concept. The question was whether the test was doing its job, and it was answered by correlating the scores with evaluations of the test takers' performance on the real life task that the test was expected to predict. The current concept of validity, however, is much broader than that. It is about the scientific and social defensibility of the test. This sounds grand, but in terms of test developers' work, it is actually quite practical. Validity is concerned with the meaning of the test scores. The central question is "What is this test measuring?" followed by "How do you know?"
Validation work begins from the test developers' definition of the skill that is to be assessed in the test. Operationally, the skill is implemented in the tasks, the assessment criteria, and the actual activities that result in giving the assessment. These need to be investigated to check that the skill that is supposedly assessed in the test is also actually addressed by the tasks. But current theory emphasizes that the developers need to go further than that, they need to be able to say what the results mean.
While this is quite a challenging demand, test developers are in an ideal position to describe the skills that they are assessing, because during development, they work with the test tasks and assessment criteria for a long time. In this work, a focus on the skill being measured, which is technically known as 'the construct', means that the developers pay conscious attention to how the skill that should be assessed can best be operationalized within the practical limitations of the assessment situation. "All" they need to do to make this work part of validation is write down their ideas and the reasons for their decisions. The work then continues by empirical investigations of whether the tasks and assessment criteria actually implement these intentions. In other words, validation involves several studies and many different methodologies, all aimed at showing what the test is testing.
However, since the focus of validation is not only on what the test is testing but also on the meaning of the scores, the social dimensions of assessment and scoring also become relevant. In a deep philosophical sense, this means recognising that assessment is a human activity, it is conducted by people on the basis of considered judgement. In a societal sense, it means recognizing that test scores often form bases for social decisions. When scores are used for giving grades or for deciding whether someone is eligible for study at a university, for example this has consequences for the test takers, and assessment developers are partly responsible for these consequences, according to current validity theory. They share the responsibility with the score users, for example university student selection boards, but they cannot avoid it completely.
Validation work is important because it is about quality and accountability. If the self-reflection aspect of validation is taken seriously, this helps developers improve their tests. However, in its current, complex form, validity theory is facing an interesting challenge. It is namely not clear who needs the complex validation reports and studies that current theory seems to call for. Who, other than another tester, is going to be able to read the reports and be able to evaluate whether they mean that the test is carefully developed and that it tests what it says it tests? The current trend in validity theory is therefore to make the concept clear and operationalizable. Good examples of this work are Carol Chapelle's (1998, 1999) articles, which provide good reading especially for those who work with more formal tests such as end-of-course examinations or graduation examinations. Other strands in the current discussion focus on the social dimension of test use, and suggest that perhaps the most legitimate use of language assessments is the assistance of further language learning (e.g. McNamara 2001).
When validation is considered a process, which is undertaken one step at a time, it leads to honest attempts at trying to do a good job at assessment in other words, accountable assessment. The argument for continuing with simple and more complex validation studies is that quality matters. Moreover, doing away with tests would not do away with the need to make social decisions. They would just be made on some other basis. The aim with accountable assessment is to provide a principled basis to inform decisions. When decisions are informed by scores from language tests, the ones with the better scores usually get the benefits. This is where validation counts. It is important to know what is different between those with lower scores and those with higher scores so that it can be checked that the decisions are made on a justifiable basis.
As mentioned at the beginning of the article, the biggest changes in language assessment in recent decades have been about the scope of the field of language testing. Earlier, it was considered to be mostly about large-scale tests with strong emphasis on statistical analysis. While that is currently one of the main areas of activity and research, another equally important strand is learning-related assessment. The formats of assessment that are relevant here include various kinds of self-evaluation and peer evaluation, portfolio assessment, learning diaries, etc. In terms of testing theory, the development has meant that language testers have had to rethink their assumptions about assessment as an activity, as well as the quality criteria that apply in different contexts.
As an activity, learning-related assessment is more complex and flexible than traditional testing. According to the needs of different learning-assessment situations, arrangements can vary about
The kinds of quality criteria that drive the choices in learning-related assessment, according to McNamara (2001) include meaningfulness in the instructional process, facilitation of learning in a multitude of different ways, enhanced quality of teaching, and minimization of administrative burden on teachers. Spelled out, these criteria are self-evident to teachers, but they contrast quite clearly with the criteria that testers or educational administrators such as ministries of education have posed for assessment.
For testing theorists, the most important quality criterion is the current, broad concept of validity. This means that the skills being assessed should be defined and that the definition should be intellectually defensible, that there should be evidence of reliability, that there should also be other records and data supporting the validity of the test such as evidence of quality control from the test development process and analyses of student performances to show that the intended skills are actually assessed in the tasks, and that the consequences of using the assessment should be considered before the procedure is used and monitored afterwards. For administrators, assessment and evaluation provide measures of accountability. They enable ministries to see how much progress is being made in teaching and learning, and thus, what governments and states are getting for their investment in education. McNamara points out that the needs the three groups are partially in conflict.
In the past, practitioners and theorists have avoided the conflict by not considering learning-related assessment part of the "testing" or the "administrative" world. The latter two have consciously or subconsciously tended to support each other. When learning-related assessment is now beginning to be part of the testing/assessment world, some work is needed to make sure that the unification serves the interests of both parties. This involves both challenges and promises. The challenges on the testing side include that theorists need to consider what research is needed to understand the principles of learning-related assessment and support its legitimacy. McNamara (2001) suggests that, for instance, we know very little about the processes of reflection and analysis that teachers and students go through when they evaluate a piece of student work, and the processes may be different in different assessment situations. The challenges on the teaching side include that there is a need for explicit reflection on the nature of student skills at different levels of ability and on the features of performance (and possibly assessment situation) that lead to different assessment outcomes. The promise is that with the new kinds of research, reflection and analysis, new knowledge about the principles that underlie our daily action is uncovered. This will hopefully improve our work and possibly also reduce the gap between theory and practice, which seems quite considerable at times.
Reflection of assessment principles and practices helps teachers put their implicit ideas and understandings about language ability into words. This is something that teachers can do alone or in small groups. Part of the work is self-analysis, but teachers may be able to get further in their reflection and analysis if they present their thoughts to a colleague or two. Minimally, the process consists of two steps, description and analysis/reflection.
To begin the description, the teacher should choose one test or assessment procedure that he or she has used recently and ask questions about it. It is helpful to write down the answers, though draft form is quite enough; no polished formulations are needed at this stage. A basic list of questions might be:
The analysis/reflection stage works on the draft texts produced in the previous stage. The idea is to focus on the teacher's concept of language ability a) related to a particular assessment instrument, and b) in general. The steps are:
Moreover, there is a need to analyse the philosophical principles that drive the quality criteria posed for different kinds of assessments. Some theoretical work in this area has already been done, and perhaps what is needed next is practitioner analysis by teachers and testers on what this means in practical terms. A good starting point for this work is Moss's (1992) analysis of psychometric and hermeneutic assessment philosophies.
Psychometric principles are what drive traditional testing. The aim in psychometrically driven testing is to predict future performance as accurately as possible. Thus, the tester looks for evidence of behavioural consistencies to support the prediction. Consistency can only be found if one thing is tested many times, and within limited testing times, this means that each individual points that is tested is quite small. The whole, i.e. language ability, is then assessed by summing up the small components. If some components are variable or context-dependent, they might be left out of the test because they do not contribute to the discovery of consistencies. Similarly, features of performance that do not appear in every test taker's performance might be ignored because they do not offer a reliable basis for prediction. Consistency and equality are valuable properties also in test administration. Premium is placed on comparability. The tasks, performance conditions, and assessment procedures must be the same for all test takers. This guarantees fairness of measurement in psychometric assessment.
In hermeneutic assessment, the overall aim is to understand the whole (that is, the nature of an individual learner's language ability) in the light of its parts. Case by case, the interpretation and evaluation is modified until all the features of the performance are accommodated. The tasks, rather than operationalizing a theoretical notion of skill, are representative of the individual, and different individuals might perform different tasks. Premium is placed on finding the most appropriate descriptions of individuals' skills, and therefore, rational argumentation among evaluators is valued. The evaluations of those who know the examinees well, such as teachers or long-time peers, are respected in particular, and the justification for the fairness of hermeneutic assessment comes from versatility of performance data and defensibility of the final, individualised assessment.
Clearly, the quality criteria associated with the above approaches to assessment are very different. While it may not be common to find extreme applications of either philosophy in the real world of language education today, it might be useful to keep the distinctions in quality criteria in mind, particularly in order to avoid applying the criteria of one mode to examples of the other, at least without questioning why one should do so. Thus, when we use large-scale tests that aim to predict future performance, it would be prudent to make sure that these instruments are truly standardized: reliable, comparable, consistent, and impartial. This can be shown through quantitative (and some qualitative) analyses. In the light of current validity theory, the consequences of test use should also be considered. In individualised, hermeneutically motivated assessment, the most important quality criteria are transparency and justification. The examinees should know what will be evaluated, and they should be aware of the range of possible tasks that they can choose from. From the evaluation process, there should be evidence that in each case, the evaluators put effort into accommodating the assessment interpretation to the individual, and that the final assessment is rationally defensible. Because of the likely context where hermeneutically motivated assessments would be used as well as the type of assessment information to be produced, statistical analyses are hardly relevant in quality assurance. Rather, teachers might consider presenting a series of assessment cases to a colleague, or in some contexts perhaps to the examinee, to see whether their assessments are defensible.
The above is the current stage of development in the theory of philosophies of assessment. The ideas now need application into the reality of educational assessment to see whether or to what extent they are helpful, and what more is needed to understand the values that guide the newly redefined world of language assessment.
There are a large number of advances and activities going on in language assessment today. The good news for the teaching world is that quite a large part of it is beginning to be relevant for teachers, and serious attempts are being made to bring the worlds of teaching and testing closer together. On these arenas, contributions from teachers would be very welcome.
Sprogforum's homepage |
Contents of
this number
|
Ordering
Sprogforum
|
