|
VOLUME 30, NUMBER 11 December 1999 SCIENCE Changes will improve quality of tests New testing standards will encourage test users and developers to re-evaluate how they administer and design tests.
By Beth Azar
The latest revision of "Standards for Educational and Psychological Testing"--the "Rosetta stone" of testing development and administration--calls on test developers and administrators to support with strong evidence every claim they make about a test. And while some of the new requirements may be more burdensome than the old standards, they should improve the quality of tests and their use, say those familiar with the document released last month. "Most of the new standards, and many of the changes made to the old ones, support credible test use and development," says psychologist Wayne Camara, PhD, an APA council representative of Div. 14 (Society of Industrial and Organizational Psychology), which voted to approve the standards during the August council meeting. "The document should help improve testing." The standards, published jointly since 1954 by APA, the American Educational Research Association (AERA) and the National Council on Measurement in Education (NCME), provide guidelines based on the latest research for constructing, validating and administering tests. Conducted by a committee jointly appointed by the three organizations, the revision is intended to clarify the process test developers should use to validate their creations. The standards are presented in a new format designed to encourage test developers and users to better scrutinize each standard and evaluate whether it might apply to them. Those changes should increase the accountability of test developers and administrators and encourage them to adhere more closely to those standards that apply to them, says Paul Sackett, PhD, of the University of Minnesota, who served as co-chair of the revision committee. But some people will find the changes to the standards inconvenient because they force them to re-evaluate how they develop and administer tests, say people who have seen the final version. "It will be an extra burden initially," says Camara, director of the Office of Research and Development at the College Board. "We'll need to read each and every standard, and document how we met one, or why we didn't meet another." But, in general, the revised set of standards is an improvement over the last set, published in 1985, because it's more comprehensive and better represents the state of science in testing, says Camara. Every standard counts One of the more controversial changes to the standards document is the removal of labels ranking each standard "primary," "secondary" or "conditional." The original intent of the labels was to distinguish between standards that were mandatory for everyone--primary--and those only needed in particular situations--secondary and conditional. Many people who design and administer tests criticize the change, arguing that it will force them to document their reasons for not adhering to any one standard. But too often, says Sackett, people "skipped the secondary standards altogether and skimmed the conditional ones." And that wasn't the intent. The new format encourages people to carefully read each standard and evaluate whether it applies to their situation. For example, a test developer may read the standard that suggests considering test users' cultural backgrounds when developing test content and decide the standard doesn't apply to a test of simple numerical computation. Some welcome the removal of the labels, saying it reinforces the importance of reviewing all the standards. Others believe it will make life more difficult. Several industrial and organizational psychologists who develop and administer tests, for example, commented that no longer ranking the standards could hurt them in court when someone claims a test's outcome is invalid: Under the new standards, the burden of proof will be on them to explain why they didn't comply with all standards. "An advocate challenging your assessment system is going to raise every plausible argument that he can to raise doubt about your credibility," says industrial psychologist Jim Sharf, PhD, a consultant on developing, implementing and defending selection and appraisal systems. "Sure, the standards says 'use your professional judgement,'" he adds. "But we know we're going to have [every standard] shoved at us in an adversarial contest. 'Did you adhere to standard 5.3? No? Why not?' And all we have to come back with is 'It was my judgment.' They've made our role more complicated by removing some tentative indication of which standards are of primary, secondary and conditional interest." Sackett understands the concerns, but, he says, the revision committee worked hard to word the standards to make it clear to everyone--including courts of law--that it is not meant as a checklist to be followed in a rigid fashion. "The absence of designations such as 'primary' or 'conditional' should not be taken to imply that all standards are equally significant in any given situation," the standards state. "Depending on the context and purpose of test development or use, some standards will be more salient than others." Redefining validity Along with the change in how the standards are labeled, the section of the document on how to measure whether a test is "valid" has some people concerned. The drafting committee solidified a trend, beginning with the 1985 standards, to redefine validity . Traditionally people have spoken of three types of validity:
In the new standards, these measures are no longer considered "types of validity," but "types of evidence" for validating a test. Validity itself, says the new version, is the degree to which the accumulated evidence supports the specific interpretations that test developers, or users, claim they can make using a test's score. For example, if a test developer says a test accurately measures a student's understanding of high-order math concepts, it should have evidence from well-done research studies to prove that. And if a test administrator says the test he or she is using will predict job performance, there should be studies to back that up. "Test developers and users need to start with the question: What are the inferences we wish to draw from the test score?" says Sackett. "Then they accumulate evidence to support that." In some sense, this change makes life easier for testing professionals--there are no "valid" tests, rather there are valid uses of tests so testing professionals only need evidence to support the specific inference they wish to make with a test, says Sackett. Some testing researchers and administrators worry that the shift will mean they will need to revalidate old tests. But that's not the intention, says Sackett. As long as the evidence supports the inference that test developers and test administrators are making in creating a test, the test will still be valid for that use.
"There won't be much difference in how studies are done," adds standards committee member Ed Haertel, PhD, of Stanford University's department of education. "And there's tremendous latitude given to professional judgment. We're not asking people [who use the standards] to do something if they think it's unreasonable, as long as they explain why."
PsychNET®
APA Home Page
.
Search
.
Site Map
|
|