logo ELT Concourse teacher training
Concourse 2

Testing, assessment and evaluation

testing

Gentle warning: this is a complex area which is littered (some might say infested) with terminology.  You may like to take it a bit at a time.


changing face

The changing face of testing: a little history

Over the years, what we test and how we test it in our profession have seen great changes.  Here are three citations that show how.

In what follows, it is not assumed that it is always communicative ability which we want to test but that's usually the case and definitely the way to bet.


define

Defining terms

Why the triple title?  Why testing and assessment and evaluation?  Well, the terms are different and they mean different things to different people.

If you ask Google to define 'assess', it returns "evaluate or estimate the nature, ability, or quality of".
If you then ask it to define 'evaluate', it returns "form an idea of the amount, number, or value of; assess".
The meaning of the verbs is, therefore, pretty much the same but they are used in English Language Teaching in subtly different ways.

When we are talking about giving people a test and recording scores etc., we would normally refer to this as an assessment procedure.
If, on the other hand, we are talking about looking back over a course or a lesson and deciding what went well, what was learnt and how people responded, we would prefer the term 'evaluate' as it seems to describe a wider variety of data input (testing, but also talking to people and recording impressions and so on).  Evaluation doesn't have to be very elaborate.  The term could be used to describe nodding to accept an answer in class up to formal examinations set by international testing bodies but at that end of the cline, we are more likely to talk about assessment and examining.

Another difference in use is that when we measure success for ourselves (as in teaching a lesson) we are conducting evaluation; when someone else does it, it's called assessment.

In what follows, therefore, the terms are used to mean the same thing but the choice of which term to use will be made to be appropriate to what we are discussing.

How about 'testing'?  In this guide 'testing' is seen as a form of assessment but, as we shall see, testing comes in all shapes and sizes.  Look at it this way:

eta As you see, testing sits uncomfortably between evaluation and assessment.  If testing is informal and classroom based, it forms part of evaluation.  A bi-weekly progress test is part of evaluation although learners may see it as assessment.  When testing is formal and externally administered, it's usually called examining.
Testing can be anything in between.  For example, an institution's end-of-course test is formal testing (not examining) and a concept-check question to see if a learner has grasped a point is informal testing and part of evaluating the learning process in a lesson.
Try a short matching test on this area.  It doesn't matter too much if you have all the answers right.


why

Why evaluate, assess or test?

It's not enough to be clear about what you want people to learn and to design a teaching programme to achieve the objectives.  We must also have some way of knowing whether the objectives have been achieved.
That's called testing.

If you can't measure it, you can't improve it
Peter Drucker


types

Types of evaluation, assessment and testing

We need to get this clear before we can look at the area in any detail.

Initial vs. Formative vs. Summative evaluation
Initial testing is often one of two things in ELT: a diagnostic test to help formulate a syllabus and course plan or a placement test to put learners into the right class for their level.
Formative testing is used to enhance and adapt the learning programme.  Such tests help both teachers and learners to see what has been learned and how well and to help set targets.  It has been called educational testing.  Formative evaluation may refer to adjusting the programme or helping people see where they are.  In other words, it may targeted at teaching or learning (or both).
Summative tests, on the other hand, seek to measure how well a set of learning objectives has been achieved at the end of a period of instruction.
Robert Stake describes the difference this way: When the cook tastes the soup, that's formative.  When the guests taste the soup, that's summative. (cited in Scriven 1991:169).
Informal vs. Formal evaluation
Formal evaluation usually implies some kind of written document (although it may be an oral test) and some kind of scoring system.  It could be a written test, an interview, an on-line test, a piece of homework or a number of other things.
Informal evaluation may include some kind of document but there's unlikely to be a scoring system as such and evaluation might include, for example, simply observing the learner(s), listening to them and responding, giving them checklists, peer- and self-evaluation and a number of other procedures.
Objective vs. Subjective assessment
Objective assessment (or, more usually, testing) is characterised by tasks in which there is only one right answer.  It may be a multiple-choice test, a True/False test or any other kind of test where the result can readily be seen and is not subject to the marker's judgement.
Subjective tests are those in which questions are open ended and the marker's judgement is important.
Of course, there are various levels of test on the subjective-objective scale.
Criterion-referencing vs. Norm-referencing in tests
Criterion-referenced tests are those in which the result is measured against a scale (e.g., by grades from A to E or by a score out of 100).  The object is to judge how well someone did against a set of objective criteria independently of any other factors.  A good example is a driving test.
Norm-referencing is a way of measuring students against each other.  For example, if 10% of a class are going to enter the next class up, a norm-referenced test will not judge how well they achieved a task in a test but how well they did against the other students in the group.  Some universities apply norm-referencing tests to select undergraduates.

There's a matching exercise to help you see if you have understood this section.  Click here to do it.


good or bad

Testing – what makes a good test?

You teach a child to read, and he or her will be able to pass a literacy test.
George W. Bush

The first thing to get clear is the distinction between testing and examining.
Complete the gaps in following in your head and then click on the table to see what answers you get.
test vs exam task

One more term (sorry):
The term 'backwash' or, sometimes, 'washback', is used to describe the effect on teaching that knowledge of the format of a test or examination has.  For example, if we are preparing people for a particular style of examination, some (perhaps nearly all) of the teaching will be focused on training learners to perform well in that test format.


types

Types of tests

There are lots of these but the major categories are

Test types What the tests are intended to do Example
aptitude tests test a learner’s general ability to learn a language rather than the ability to use a particular language The Modern Language Aptitude Test (US Army) and its successors
achievement tests measure students' performance at the end of a period of study to evaluate the effectiveness of the programme an end-of-course or end-of-week etc. test (even a mid-lesson test)
diagnostic tests discover learners' strengths and weaknesses for planning purposes a test set early in a programme to plan the syllabus
proficiency tests test a learner’s ability in the language regardless of any course they may have taken public examinations such as FCE etc. but also placement tests
barrier tests a special type of test designed to discover if someone is ready to take a course a pre-course test which assesses the learner's current level with respect to the intended course content

As far as day-to-day classroom use is concerned, teachers are mostly involved in writing and administering achievement tests as a way of telling them and the learners how successfully what has been taught has been learned.


items

Types of test items

Here, again, are some definitions of the terminology you need to think or write about testing.

alternate response
This sort of item is probably most familiar to language teachers as a True / False test.  (Technically, only two possibilities are allowed.  If you have a True / False / Don't know test, then it's really a multiple-choice test.)
multiple-choice
This is sometimes called a fixed-response test.  Typically, the correct answer must be chosen from three or four alternatives.  The 'wrong' items are called the distractors.
structured response
In tests of this sort, the subject is given a structure in which to form the answer.  Sentence completion items of the sort which require the subject to expand a sentence such as He / come/ my house / yesterday / 9 o'clock into He came to my house at 9 o'clock yesterday are tests of this sort as are writing tests in which the test-taker is constrained to include a list of items in the response.
free response
In these tests, no guidance is given other than the rubric and the subjects are free to write or say what they like.  A hybrid form of this and a structured response item is one where the subject is given a list of things to include in the response but that is usually called a structured response test, especially when the list of things to include covers most of the writing and little is left to the test-taker's imagination.

ways

Ways of testing and marking

Just as there are ways to design test items and purposes for testing (see above), there are ways to test in general.  Here are the most important ones.

Methodology Description Example Comments
direct testing testing a particular skill by getting the student to perform that skill testing whether someone can write a discursive essay by asking them to write one The argument is that this kind of test is more reliable because it tests the outcomes, not just the individual skills and knowledge that the test-taker needs to deploy
indirect testing trying to test the abilities which underlie the skills we are interested in testing whether someone can write a discursive essay by testing their ability to use contrastive markers, modality, hedging etc. Although this kind of test is less reliable in testing whether the individual skills can be combined, it is easier to mark objectively
discrete-point testing a test format with many items requiring short answers which each target a defined area placement tests are usually of this sort with multiple-choice items focused on vocabulary, grammar, functional language etc. These sorts of tests can be very objectively marked and need no judgement on the part of the markers
integrative testing combining many language elements to do the task public examinations contain a good deal of this sort of testing with marks awarded for various elements: accuracy, range, communicative success etc. Although the task is integrative, the marking scheme is designed to make the marking non-judgemental by breaking down the assessment into discrete parts
subjective marking the marks awarded depend on someone’s opinion or judgement marking an essay on the basis of how well you think it achieved the task Subjective marking has the great disadvantage of requiring markers to be very carefully monitored and standardised to ensure that they all apply the same strictness of judgement consistently
objective marking marking where only one answer is possible – right or wrong machine marking a multiple-choice test completed by filling in a machine-readable mark sheet This obviously makes the marking very reliable but it is not always easy to break language knowledge and skills down into digital, right-wrong elements.
analytic marking the separate marking of the constituent parts that make up the overall performance breaking down a task into parts and marking each bit separately (see integrative testing, above) This is very similar to integrative testing but care has to be taken to ensure that the breakdown is really into equivalent and usefully targeted areas
holistic marking different activities are included in the overall description to produce a multi-activity scale marking an essay on the basis of how well it achieves its aims (see subjective marking, above) The term holistic refers to seeing the whole picture and such test marking means that it has the same drawbacks as subjective marking, requiring monitoring and standardisation of markers.

Naturally, these types of testing and marking can be combined in any assessment procedure and often are.
For example, a piece of writing in answer to a structured response test item can be marked by awarding points for mentioning each required element (objective) and then given more points for overall effect on the reader (subjective).


three concepts

Three fundamental concepts:
reliability, validity and practicality

  1. Reliability
    This refers, oddly, to how reliable the test is.  It answers this question:
    Would a candidate get the same result whether they took the test in London or Kuala Lumpur or if they took it on Monday or Tuesday?
    This is sometimes referred to as the test-retest test.  A reliable test is one which will produce the same result if it is administered again.  Statisticians reading this will immediately understand that it is the correlation between the two test results that measures reliability.
  2. Validity
    Two questions here:
    1. Does the test measure what we say it measures?
      For example, if we set out to test someone's ability to participate in informal spoken transactions, do the test items we use actually test that ability or something else?
    2. Does the test contain a relevant and representative sample of what it is testing?
      For example, if we are testing someone's ability to write a formal email, are we getting them to deploy the sorts of language they actually need to do that?
  3. Practicality
    Is the test deliverable in practice?  Does it take hours to do and hours to mark or is it quite reasonable in this regard?

For examining bodies, the most important criteria are practicality and reliability.  They want their examinations to be trustworthy and easy (and cheap) to administer and mark.
For classroom test makers, the overriding criterion is validity.  We want a test to test what we think it tests and we aren't interested in getting people to do it twice or making it (very) easy to mark.

There's a matching test to help you see if you have understood this section.  Click here to do it.

So:

  1. How can we make a test reliable?
  2. How can we make a test valid?

reliability

Reliability

If you have been asked to write a placement test or an end-of-course test that will be used again and again, you need to consider reliability very carefully.  There's no use having, e.g., an end-of-course test which produces wildly different results every time you administer it and if a placement test did that, most of your learners would end up in the wrong class.
To make a test more reliable, we need to consider two things:

  1. Make the candidates’ performance as consistent as possible.
  2. Make the scoring as consistent as possible.

How would you do this?  Think for a minute and then click here.


validity

Validity

If you are writing a test for your own class or an individual learner or group of students for whom you want to plan a course, or see how a course is going, then validity is most important for you.  You will only be running the test once and it isn't important that the results are correlated to other tests.  All you want to ensure is that the test is testing what you think it's testing so the results will be meaningful.
There are five different sorts of validity to consider.  Here they are:

validity

To explain:

Face validity
Students won't perform at their best in a test they don't trust is really assessing properly what they can do.  For example, a quick chat in a corridor may tell you lots about a learner's communicative ability but the learner won't feel he/she has been fairly assessed (or assessed at all).
The environment matters, too.  Most learners expect a test to be quite a formal event held in silence with no cooperation between test-takers.  If the test is not conducted in this way, some learners may not take it as seriously as others and perform less well than they are able to in other environments.
Content validity
If you are planning a course to prepare students for a particular examination, for example, you want your test to represent the sorts of things you need to teach to help them succeed.
A test which is intended to measure achievement at the end of a course also needs to contain that which has been taught only and not have any extraneous material which has not been the focus of teaching.
Coverage plays a role here, too, because the more that has been taught, the longer and more comprehensive the test has to be.
Predictive validity
Equally, your test should tell you how well your learners will perform in the tasks you set and the lessons you design to help them prepare for the examination.
For example, if you want to construct a barrier test to see if people are able successfully to follow a course leading to an examination, you will want the test to have good predictive validity.  This is not easy to achieve because, until at least one cohort of learners have taken the examination, you cannot know how well the barrier test has worked.  Worse, you need to administer the barrier test to a wide range of learners and compare the results of the test with the examination results they actually achieved.  This will mean that the barrier test cannot be used to screen out learners until it has been shown to have good predictive validity so the endeavour may take months to come to fruition.
Concurrent validity
If, for example, you have a well established proficiency test, such as one administered by experienced examination boards, you may feel that you would be better served with a shorter test that gave you the same sort of data.
To establish concurrent validity, you need to administer both tests to as large a group as possible and then carefully compare the results.  Parallel results are a sign of good concurrent validity and you may be able to dispense with the longer test altogether.
This may be less important to you but if your test predicts well how learners perform in the examination proper, it will tell you more than if it doesn't.
Construct validity
A construct is something that happen in your brain and is not, here, to do with constructing a test.
To have high construct validity a test-maker must be able succinctly and consistently to answer the question:
    What exactly are you testing?
If you cannot closely and accurately describe what you are testing, you will not be able to construct a good test.
It is not enough to answer with something like:
    I am testing writing ability.
because that begs more questions:
    At what level?
    Concerning what topics?
    For which audiences?
    In what style?
    In what register?
    In what length of text?

and so on.
The better able the test designer is to pre-empt those questions by having well-thought-through answers to hand, the higher the level of construct validity the test will have.

fresh

Fresh starts

This gets a section to itself, not because it is at the same level of importance but because it affects both reliability and validity.

If test items are cumulative, the test-takers performance will depend in Task X on how well they achieved Task X-1.  In other words, for example, a test which requires a learner to give answers showing comprehension of a reading or listening text and then uses those texts again to test discrete vocabulary items will not be:

  1. Very reliable because the test-taker may have got lucky and encountered a text with which they were particularly familiar or which happened to contain lexis they knew (among lots they did not know).
  2. Very valid because the response to the second task will depend on the response to the first task so we do not know if we are measuring the ability we want to test or the ability we have already tested.

For these reasons, good tests are usually designed so that each item constitutes a fresh start and no test-taker is advantaged by happening to have got lucky with one task that targets something they are particularly good at.

seven

Discrimination

In the world of testing, discrimination is not always a bad thing.
Here, it refers to the ability which a test has to distinguish clearly and quite finely between different levels of learner.

If a test is too simple, most of the learners in a group will get most of it right which is good for boosting morale but poor if you want to know who is best and worst at certain tasks.
Equally, if a test is too difficult, most of the tasks will be poorly achieved and your ability to discriminate between the learners' abilities in any area will be compromised.

Ideally, all tests should include tasks which will only be fully achieved by the best in a group and allow you to see from the results who they are.  The item you include to do this will be called a discriminator.

Overall, the test has to be targeted at the level of the cohort of students for which it is intended allowing no items which are too easy and none which are undoable by the majority.


Finally, having considered all this, you need to construct your test.  How would you go about that?
Think for a moment and make a few notes and then click here.

Easy.



Related guides  
assessing Listening Skills these guides assume an understanding of the principles and focus on skills testing
assessing Reading Skills
assessing Speaking Skills
assessing Writing Skills
testing terminology for a list of the most common terms used in this area and a link to test for you
placement testing this is a guide in the Academic Management section concerned with how to place learners in appropriate groups and contains a link to an example 100-item placement test
Bloom's taxonomy this is a way of classifying the cognitive demands that types of test items place on learners


Of course, there's a test on all of this: some informal, summative evaluation for you.


Cambridge Delta

If you preparing for Delta Module One, part of the free course for that contains a guide to applying the concepts to the examination question (Paper 2, Task 1).
If you are preparing for Delta Module Three, there's a guide to how to apply all this.


References:
Alderson JC, Hughes, A (Eds.), British Council, ELT Documents 111, Issues in Language Testing, available from http://wp.lancs.ac.uk/ltrg/files/2014/05/ILT1981_CommunicativeLanguageTesting.pdf [accessed October 2014]
Corder, S. P, 1973, Introducing Applied Linguistics, London : Penguin.
Hughes, A, 1989, Testing for Language Teachers, Cambridge: Cambridge University Press
Oxford Dictionaries https://languages.oup.com/
Scrivener, M, 1991, Evaluation thesaurus, 4th edition, Newbury Park, CA: Sage Publications
General references for testing and assessment.  You may find some of the following useful.  The text (above) by Hughes is particular clear and accessible:
Alderson, J. C, 2000, Assessing Reading, Cambridge: Cambridge University Press
Carr, N, 2011, Designing and Analyzing Language Tests: A Hands-on Introduction to Language Testing Theory and Practice, Oxford: Oxford University Press
Douglas, D, 2000, Assessing Languages for Specific Purposes. Cambridge: Cambridge University Press
Fulcher, G, 2010, Practical Language Testing, London: Hodder Education
Harris, M & McCann, P, 1994, Assessment, London: Macmillan Heinemann
Heaton, JB, 1990, Classroom Testing, Harlow: Longman
Martyniuk, W, 2010, Aligning Tests with the CEFR, Cambridge: Cambridge University Press
McNamara, T, 2000, Language Testing, Oxford: Oxford University Press

Rea-Dickins, P & Germaine, K, 1992, Evaluation, Oxford: Oxford University Press
Underhill, N, 1987, Testing Spoken Language: A Handbook of Oral Testing Techniques, Cambridge: Cambridge University Press