Wednesday, October 26, 2011

The pitfalls and plusses of multiple choice tests

Perhaps no test is regarded with greater suspicion than the multiple choice test. It measures trivial, disembodied facts and passive knowledge in a format that never arises in real life; it can be gamed; it's riddled with trick questions. These criticisms are justified, but not because multiple choice tests are inherently flawed. It's just that so many of the ones one comes across are bad.

They're bad because it's really hard to design good ones. Their format--passive selection of one of several short answers--makes it easy for test designers to fall one of into several traps. Most tempting is the "match the definition to the label" question, which often confuses labels with concepts and ends up testing trivial knowledge. Nearly as tempting is the "gotcha" question--often the easiest way to make a test difficult enough that scores fall into a convenient bell curve. Gotcha questions fall into one of two subtypes. The trick question is phrased in such a way that, unless you read it really carefully, it lures you towards a wrong answer; the "did you do the reading" question asks a highly specific, often trivial question that you'll only be able to answer if you did all the reading carefully and remember it in detail. Balancing out the gotcha questions are ones that make the test easier than it should be: testers often accidentally include questions whose answers are obvious even to outsiders, or sets of choices that collectively signal, to those skilled at gaming tests, which one is correct.

Consider the following questions I've culled from two different online introductory psychology quizzes. First we have two in the "match the definition to the label" category, both of which also tap common knowledge of common terminology:

The debate among psychologists regarding the relative contributions of environment and heredity to the developmental process is called
A) the critical period
B) the nature-nurture controversy
C) the stage controversy
D) behaviorism

The "two-way street" concept in childrearing suggests that
A) both mothers and fathers need to accept responsibility for childrearing
B) parents need to be consistent in their childrearing approaches with all their children
C) children act as important influences on their siblings
D) children's behavior affects their parents' behavior just as parents' behavior affects their children's behavior
This next one is similar. Its answer doesn't involve a common term like "nature-nurture" or "two-way street," but is still inferrable from the standard definitions of the various words that follow "self-":
A teacher asks students to keep a record of how much time they spend doing homework daily, and they find that their study time increases. What is this procedure called?
a) Self-assessment
b) Self-monitoring
c) Self-enhancement
d) Self-reinforcement
Yet another "match the definition to the label" question, rather than being obvious, is out to trick you:
The use of technology to present material that progresses in small steps toward a well-defined final goal and is sequenced so that students can answer correctly the majority of the time is called
a) applied behavior analysis.
b) computer-tailored instruction.
c) drill and practice.
d) programmed instruction.
 If you didn't memorize the course terminology, the word "technology" might lead you to "computer-tailored instruction;" alternatively, "small steps" and "well-defined goal" might lead you to "applied behavior analysis."

Then we have a did-you-do-the-reading question which simultaneously manages to be gameable. Did you ever notice how often the "all of the above," "none of the above," and "some of the above" answers are the correct ones?
With regard to variation in development, the text asserts that
A) different children develop at different rates
B) children vary in their own rate of development from one period to the next
C) little variation exists between children beyond the age of seven
D) a and b above
Here are some other examples of this (I found no counterexamples in these tests):
In which of the following areas do adolescents have more challenges when compared with younger and older individuals?
A) parent-child conflicts
B) mood changes
C) risky behavior
D) all of the above
According to research comparing children in day-care centers versus children raised by mothers in their own homes, the biggest differences were found in the children's
A) physical health
B) intellectual development
C) attachment
D) none of the above 
Returning to trick questions, another one uses "identical twins" to lure you towards "genetics." While even an outsider can rule out "imprinting," only if you remember the particular study on toilet training alluded to here will you know the correct answer:
Research on toilet training conducted with identical twins illustrates the importance of which developmental factor?
A) maturation
B) imprinting
C) nurture
D) genetics
Perhaps the biggest problem with multiple choice questions is that they so often tap superficial knowledge of labels rather than deep understanding of concepts. We've already seen four examples of this; here are two more:
The recognition that the volume of water remains the same whether it is in a short, wide beaker, or a long, narrow beaker is called
A) reversibility
B) conservation
C) decentering
D) formal operations

According to Kohlberg, at what level of moral development would a child most likely be concerned about pleasing his parents and teachers?
A) the preconventional level
B) the premoral level
C) the conventional level
D) the principled level
My reaction to these questions is who cares what the recognition about water volume is called, and who cares what Kollbeg called the parent-pleasing level of moral development? Aren't there more interesting, understanding-tapping questions that one could ask about these issues? For example, couldn't one ask which developmental milestone co-occurs with, or might account for, the recognition about water volume?

In fact, there are good multiple choice questions, even within the two psychology tests I'm discussing here:
Which of the following is not developed during the infancy period?
A) object permanence
B) telegraphic speech
C) separation anxiety
D) transductive reasoning

Which of the following describes the correct developmental sequence of play?
A) parallel play, solitary play, cooperative play
B) solitary play, cooperative play, parallel play
C) solitary play, parallel play, cooperative play
D) cooperative play, solitary play, parallel play

Which of the following cognitive abilities improves throughout adulthood?
A) reasoning about everyday problems
B) knowledge of facts and word meanings
C) abstract problem solving and divergent thinking
D) general recall

Compared to individuals in their 20s, individuals in their 70s showed declines in
A) knowledge of word meanings
B) understanding mathematical concepts
C) solving life problems
D) fluid intelligence

Adult personalities are likely to change in each of the following areas except
A) enjoyment of being with other people
B) becoming more dependable
C) becoming more candid
D) becoming more accepting of life's hardships
It's just that--as I know from personal experience--it can take tremendous time and effort to ensure that a multiple choice question requires no more and no less than an accurate comprehension of meaningful, course-specific concepts.

Why exert the effort? Because if one has hundreds of students, multiple choice tests save much more time than they take to devise, freeing up precious hours for things other than assessment.  And because there are times when the objectivity and standardizability of multiple choice tests makes them far preferable to the more subjective (if more "authentic") alternatives (c.f. the recent discussion on kitchentablemath on the essay section of the SAT Writing test.)

Designed well, multiple choice tests can measure exactly what they're suppose to. If they didn't, few people would take the PSAT, SAT, AP seriously, let alone treat them--as so many people and institutions so often do--as meaningful measures of aptitude or achievement.

1 comment:

C T said...

Sometimes in college, I was lucky enough to have courses with great multiple choice exams. Not only did I feel like they were fairly assessing me, but I learned while taking the tests and felt stretched intellectually. That some tests are written poorly is no reason to demonize them as some people do; just write better ones.