A Closer Look

The last time you took or administered a test, whether it was comps, a state licensing exam, a personality inventory or an employment aptitude test, you were likely more concerned with its results than the nuts and bolts of its creation. But for the psychologists of Div. 5 (Evaluation, Measurement and Statistics), those tests' development--and their reliability and validity--are the primary focus.

Some of Div. 5 members' projects include developing computer programs to score essay tests, redesigning personality tests so they are less easy to manipulate and studying racial differences and testing performance. The cutting-edge research of these and other psychologists will ensure that the tests of the future are both fair and accurate.

"It's an exciting time in measurement and statistics, assessment and evaluation," says Lawrence Stricker, PhD, president of Div. 5 and senior associate in the Educational Testing Service's research and development division. "There have been revolutionary developments in the last 20 or 30 years in psychometrics and statistics. This measurement work is improving research and practice across psychology."

Computer-scored essay tests

Multiple-choice tests efficiently judge an individual's grasp of facts, but essay tests better measure how people will perform in real-life situations, says Div. 5's Mark Shermis, PhD, a leader in the development of automated essay scoring (AES).

"If we're looking at how well you persuade individuals, or how you take a position or what your line of reasoning is, then an essay can do a much better job of assessing that than filling out a bubble on a multiple-choice scan sheet," says Shermis, who is also professor and chair in the educational psychology department at the University of Florida.

AES, popularized about a decade ago, taps statistical models to score both writing ability and essay-content accuracy. Shermis develops a model by running several hundred essays through a tagging and parsing computer program that identifies variables related to constructs such as content, creativity, style, mechanics and organization. Then he regresses those variables on human raters' scores to tell the program what variables matter. Finally, he cross-validates by repeating the whole process using a second sample of about 200 essays.

To judge content, some program developers use word nets, which group nouns, verbs, adjectives and adverbs into sets of cognitive synonyms, to link meaningfully related vocabulary, says Shermis. Another commonly used statistical technique is latent semantic analysis (LSA), which analyzes blocks of text to extract and represent the similarity of meaning among words and passages. For example, if Shermis was developing a program to score an essay question that asked about the differentiation between Freud's id, superego and ego, he might input text from an introductory psychology textbook into the computer, which would then use LSA to set up the relationship between the words and their context.

But can computers really fairly judge writing, a fundamental component of human expression? In a chapter published in the "Handbook of Writing Research" (Guilford Press, 2005), Shermis and colleagues found that in terms of reliability, AES evaluations are equivalent to or higher than human raters. Shermis also found that computers and human raters who graded the same papers agreed on a given score 86 percent of the time, on average.

Though his research findings inspire confidence in AES, Shermis says it will take time for people to become comfortable with the new technology.

Faked personalities

Personality inventories are becoming a common part of many job applications, particularly for civil service, business and military positions, says Div. 5 member Fritz Drasgow, PhD, professor of industrial-organizational psychology at the University of Illinois at Urbana-Champaign. However, says Drasgow, personality tests are notoriously easy to fake. For example, a sample item that purports to judge conscientiousness might ask how much you agree with the statement: "I always make my deadlines."

"If you want the job, of course you're going to say you always meet your deadlines," says Drasgow.

To combat this difficulty for personality measurement, Drasgow designs computer adaptive assessments based on item-response theory (IRT), or the idea that people's underlying traits or abilities will influence the probability of them correctly answering test questions. In the assessments Drasgow designs, which research has found to be less easy to fake, test-takers are asked to choose which statement describes them better: "I always make my deadlines" or "I get along well with others."

The computer balances the social desirability of the statements so that both are generally desirable but they ask about different dimensions, such as agreeableness, emotional stability or openness to new experience, says Drasgow. Employers can customize their tests to identify candidates that are good matches, personality-wise, for a particular position. For example, for an accountant, attention to detail may be a more desirable trait than extraversion.

Drasgow sees the future of workplace personality testing as helping to improve the match between a person and a job. "Neither you nor the organization will be well suited to putting you in the wrong job," he says.

Racial differences in testing performances

Previous research has documented a disparity between whites and African Americans in performance on standardized tests of cognitive ability, says Div. 5 member Ann Marie Ryan, PhD, an organizational psychology professor at Michigan State University and editor of Personnel Psychology. One concern is whether these differences persist even among equally able people, she notes.

"Race and testing is such a prevalent concern of employers, who are interested in hiring a diverse work force," says Ryan.

To determine the underlying factors related to this racial discrepancy, Ryan and several of her former students, including Hannah-Hanh Nguyen, PhD, and Alex Ellis, PhD, investigate the differences between whites and African Americans in using test-taking strategies, as well as the effect of test-taking strategies, test-taking attitudes and stereotype threat on these groups' test performance. In research published in the Journal of Applied Social Psychology (Vol. 33, No.1, pages 1-25), Ellis and Ryan found that African Americans use effective test-taking strategies, such as double-checking their work, at equal rates as whites. However, African Americans are also more likely to use ineffective test-taking strategies, such as the "when in doubt, choose C," method of answer selection.

"For a well-designed high-stakes test, those strategies won't help you any," says Ryan. "In fact, they can harm you because you are actually using more ineffective strategies."

In a follow-up study with Ryan in Human Performance (Vol. 16, No. 3, pages 261-293), Nguyen, now an assistant professor at California State University Long Beach, found that more so than whites, African Americans report being distractible, having off-task thoughts and being more concerned about their performance on the test and its consequences than the task at hand.

Ryan and Nguyen are still investigating the origins of these test-taking strategy discrepancies, but they hypothesize that differences in access to test-preparation materials and the quality of preparation might contribute to them. They also examine stereotype threat as a possible factor in racial differences on test performance. Stereotype threat occurs when a negative stereotype adversely affects the performance of a person in that stereotyped group.

"These studies have implications for employers and anyone interested in recruiting ethnic-minority workers," adds Nguyen. "To improve their performance, it could be as easy as a quick review of test-taking strategies or a reminder to concentrate."


Further Reading

Each issue, the Monitor is highlighting the work of an APA division that has completed the five-year review process, which is conducted by the Committee on Division/APA Relations. 

Div. 5 at a glance

Div. 5 (Evaluation, Measurement and Statistics) promotes research and practical applications in several related fields: psychological assessment, evaluation, measurement and statistics. The division offers a Distinguished Dissertation Award, the Samuel J. Messick Award for Distinguished Scientific Research and the Jacob Cohen Award for Distinguished Contributions to Teaching and Mentoring. Members receive the division newsletter, The Score, and two quarterly journals, Psychological Assessment and Psychological Methods.

To join, download a membership application from www.apa.org/divisions/div5/Membership.html and mail it to the Div. 5 Administrative Office at the APA address.