skip to main content

This page has been archived and is no longer being updated regularly.

Feature

Cite This Article

Working with big data

Psychologists are embracing the use of "big data" in their research — datasets with thousands, even millions of subjects that span decades and lifetimes and have the capability to capture rare events. Although some psychologists may be wary about not collecting their own data, they are finding that the massive sizes of these datasets enable them to probe different types of questions, as well as to save time and money.

"When you have something that is a relatively rare outcome, like suicide or even suicide attempts, to do the research in a traditional way, gathering assessments, following people over time, and so forth, is very, very costly and very difficult to do," says Columbia University psychologist Barbara Stanley, PhD, who is using records from the VA to study suicide behavior.

Salene Jones, PhD, a postdoctoral researcher at Group Health Institute in Seattle, is using data from the Women's Health Initiative. Spanning more than 20 years, the initiative followed more than 160,000 postmenopausal women and primarily analyzed the effects of hormone therapy, diet modifications, and calcium and vitamin-D supplementation on the rates of certain cancers, heart disease and osteoporosis. The project also recorded data on a number of other measures; Jones is probing the dataset to examine depression before and after cancer. Most cancer studies only have information on patients after their diagnosis. This large initiative identified women who developed cancer during the study, allowing Jones to track changes in their depression levels. To acquire the same data in the traditional way, Jones would need to evaluate a large number of people for depression, wait years to see if any developed cancer, and then re-evaluate them. Using the Women's Health Initiative is saving her significant time and money.

But even with a lot of measures, there is a limit to how much big data can address. For example, another of Jones's interests, worry, isn't examined in the dataset, so for that she'll perform a more classical experiment.

Many datasets collected for research in the United States are publicly available and downloadable from the Internet. For example, to access data from the Health and Retirement Study, researchers can go to the study's website, register as a user — agreeing to the conditions of use — and download data about income, health insurance, physical and cognitive health, and more on 26,000 adults over age 50. There are also user guides on the website for additional help. (See box for a list of free datasets compiled by APA.)

Psychologists can pay to access other datasets. Indiana University psychology professor Brian D'Onofrio, PhD, for example, pays a company that integrates health insurance claims, and collaborates with colleagues in Stockholm for access to Swedish national registers to get big data on youth, his area of research. D'Onofrio uses the health insurance claims and Swedish datasets to study attention-deficit hyperactivity disorder and disentangle the effects of medications from the disorder itself on outcomes like substance abuse and suicide.

Indeed, such collaboration is key. Research with large datasets requires work with others. "This is big team science," says D'Onofrio. He notes his research group contains experts in computer science, database management and advanced biostatistics.

Through her postdoctoral fellowship, Jones identified senior investigators affiliated with the datasets she wanted to use. It's easy to reach out to the investigators and see if they are willing to collaborate, she says.

Those who work with large datasets also caution their colleagues about the amount of time it takes to analyze the data. It's true that time is saved by not collecting data, but processing it can negate those savings. Analyzing big datasets "is actually very labor intensive," says D'Onofrio. "No longer can you press return and just get the answer. Frequently, you press return and you have to come back in a day or two."

A more powerful computer might be needed to speed up processing the largest datasets. And familiarity with software tools such as R code, Python, and SQL is a must. Learning these programs can be intimidating, but a few resources are available, particularly for students.

Since 2009, APA has hosted an Advanced Training Institute on Exploratory Data Mining, a technique that identifies patterns in large datasets that may be missed with standard hypothesis-driven testing. After suggesting the topic because it wasn't taught in graduate school, University of Southern California psychologist John McArdle, PhD, also directs the weeklong program. Students are often surprised at the datasets' sizes and the statistical approaches used, but McArdle says the software is easy to learn and the student response has been great.

Jerry Davis, PhD, a professor at the University of Michigan Ross School of Business, created a similar program for social science students. He wanted students to take advantage of the "giant candy store of data out there." After two successful years, Davis says the Big Data Summer Camp has hit its stride. Students continue to meet throughout the year to help troubleshoot each other's projects.

Walter Sowden, a fourth-year psychology graduate student at Michigan who attended the camp, says that despite the steep learning curve, he was excited to incorporate big datasets into some of the traditional ways of looking at data. After all, "That's what you do as a scientist, you continuously learn. You've got to continue to grow with the technology."

There are also plenty of resources available online, from massive online open courses (MOOCs) on the different types of software to downloadable manuals and tutorial websites. In fact, the databases themselves often have help desks and statistical centers than can run some of the analyses.

Reaching out to other departments can also help psychologists learn to use these tools. "Seek out additional training in different areas, in different schools," says D'Onofrio. "Is there a school of public health? Can you take a biostatistics class?"

Davis recommends finding computer scientists or others on campus who are familiar with the software and buying them lunch. Sowden agrees, "I think what helps alleviate the intimidation is to find people who are in your same shoes, but are just a little bit farther ahead on the journey."

But perhaps the best way to learn is to just dive in, says Jones. "Something that I've really enjoyed in my postdoc is having this opportunity to work with big data," she says. "It really is a lot of fun."

The content I just read:

Letters to the Editor