What do you do with a large, multivariable dataset when you have no clue how to tease out any relevant information? You could spend a lot of time testing one hypothesis after another, or you could let a computer do all that work for you, says University of Southern California psychologist and quantitative methods expert John McArdle, PhD.
In fact, using computers to uncover patterns in large datasets—a technique known as exploratory data mining—is already used in a number of fields, such as genetic research and engineering, but is now gaining traction with behavioral scientists as well.
For McArdle, data mining is one more tool psychologists can use to get the most out of their research results. "Everybody starts out with strong views about what data they should collect and who they collect it on for good reasons," he says, "but good reasons often don't work out. So the question was, is there anything more you can do with the data?"
McArdle explored this question at a July APA-sponsored Advanced Training Institute at USC in Los Angeles with 24 psychologists and grad students. He and his colleagues told the attendees to bring their datasets with them so they could work on real-life problems while they learned new techniques.
"Data mining is important in other fields, and psychologists are getting interested in it," says APA Deputy Executive for Science Howard Kurtzman, PhD, explaining why APA offered the course. "It will enable psychology to interact more strongly with other fields that already use these methods."
With data mining, researchers input their data into statistical programs, such as R-Code (www.r-project.org) and CART (www.salford-systems.com), set the variables and run it. Within minutes, the program spits out patterns it has found among the variables. McArdle explained one example from his own research. He was looking for variables that were associated with age-of-onset for Alzheimer's disease, and he had a dataset made up of 75 variables for 800 people.
"Alzheimer's is a medical condition, people don't a have a great idea of why this happens, and we believe it's heterogeneous; for some people, some routes to get to this end state are different than for other people," McArdle says. "This is the technique to use."
He plugged in his data and a couple of associations popped out: performance on a verbal recall test 10 years before the disease's onset, and the change in that performance over time. Basically, scoring lower than average on these tests is a good predictor for developing Alzheimer's disease within a few years. If he'd sorted through all those variables using the traditional hypothesis-driven approach, McArdle says he still probably would have found this solution, but it might have taken him weeks. "The computer is only being used to carry out solutions that the investigator would have carried out given enough time," he says.
But researchers have to be careful to make sure the relationships among the variables make sense. For instance, in the Alzheimer's study, one of the strongest correlations the computer found was whether or not the subject understood the questions being asked in the first place—a useless predictor.
It's precisely this automation that's made some psychologists hesitant to embrace the technology, McArdle says. If they don't understand why the algorithms are doing what they're doing, they can't trust the outcomes. Some psychologists have even warned others against using the technique, he says. That's why it's important for psychology graduate students to learn about these data-mining programs early, perhaps as part of an introductory statistics course, McArdle suggests.
"The biggest barrier was that they were told they shouldn't do this by a lot of people," he says, "and I agree they shouldn't do this first. But they should do it eventually, and they should do it before someone else does it with their data."
When the attendees at the training institute plugged in their own data and saw what happened, they became convinced, McArdle says.
"There were shrieks of joy when people found something, and they all shrieked at one point."
One of those shrieks came from Tiarney Ritchwood, a clinical psychology graduate student at the University of Alabama in Tuscaloosa. Ritchwood brought to the training session her dataset on more than 1,000 individuals with various conduct disorders. She says the exploratory techniques she learned help her spot patterns in her data she hadn't noticed before.
"It was really helpful," Ritchwood says. And with her dissertation demanding an even larger dataset of 20,000 individuals, she says she'll use the techniques to guide her future studies.
Stacey Scott, PhD, a developmental psychologist doing postdoctoral research at Georgia Tech in Atlanta, says using data mining will expand the breadth of questions she'll be able to explore. In her longitudinal research into the psychological well-being of middle-aged and older adults, she regularly works with complex interactions among many variables. With traditional techniques, "you'd have to write this huge equation because you can't even really imagine how all these variables interact," Scott says. Using data mining at the training session, "I found some associations I hadn't really considered," she says.
While these techniques are relatively new for many psychologists, data mining is common in many other fields, McArdle says. Airplane designers use it to engineer the safest possible wing and cabin designs. Geneticists heavily rely on it because they don't have strong hypotheses about which genes produce which results—mining the genome for patterns is one of the best ways to find them.
Behavioral science is late to the game, but it has plenty of enormous datasets that psychologists could mine for patterns and relationships, McArdle says. Alzheimer's disease, depression, PTSD and other complex disorders make up large datasets with numerous variables.
"These massive datasets have hundreds of thousands of variables on thousands of people and nobody touches them because they figure they have to go in with a very rigorous selection mechanism," McArdle says. "Maybe you have to go in with a problem first and see what emerges."
APA will offer another Advanced Training Institute on data mining in the summer of 2010. Information on this or on other ATIs offered every summer is available online.