by Elodie Smith, Cornell University

collecting data

Credit: Pixabay/CC0 Public Domain

A new protocol can detect and remove fake data created by bots and humans attempting to enroll in online research studies, in order to prevent biased results and unwarranted payments to bad actors—the first such protocol specifically designed for data collected in rural communities.

The multistep protocol was inspired by a pandemic-era online study of health habits that suddenly generated hundreds of enrollment attempts, despite being based in a small rural town.

"When the study moved online, we became much more reliant on online recruitment and data collection techniques," said Karla Hanson, professor of practice in the Department of Public and Ecosystem Health in the College of Veterinary Medicine, and first author of the study, published in Methods and Protocols.

"It went from just a few individuals per day to hundreds overnight. It's implausible in a small rural town that several hundred people would enroll in our study all in one night."

To combat the issue, the researchers first removed any enrollment attempts that came from IP addresses outside the geographic study area, which filtered out 25% of attempts. However, this and other traditional automated techniques to remove fraudulent entries were insufficient for this study setting.

"We knew basic techniques, but none of them focus on rural areas specifically," Hanson said. "Some needed to be adapted to our population."

For example, another classic filtering tool limits enrollment to one person per IP address. But in rural settings where internet access is limited, Hanson said, many people in a household may share the same computer or use a public computer at a library.

"To have a representative sample that was economically diverse," she said, "we needed to adapt that limitation."

After using automated tools, Hanson and colleagues turned to manual techniques, checking all submitted addresses against a postal database. "It was very time-consuming and expensive to do all these active validation tests," Hanson said. "And at each step, we found more fraudulent enrollment."

Payments offered to study participants attracted bots and led real people to try to enroll multiple times using fake identities. "When we called, sometimes people had no knowledge of the study, so they were considered fraudulent attempts and they were excluded from the study," Hanson said. "In some cases, the phone number did not even exist."

Ultimately, they found that 74% of the attempts were fraudulent. They also discovered that some screening criteria could be overzealous and exclude real participants. For example, some people who seemed to be legitimate participants reported a weight with 100 pounds difference between years one and two of the study. In those cases, the team verified the data over the phone.

"There is some caution to have when labeling a participant as fraudulent; some people do really lose a lot of weight," Hanson said. "There are also people who typed their weight wrong and we wanted to have a conversation with those participants and understand what was going on."

Similarly, some real participants entered a different date of birth on consecutive years. The team found that more than 40 of these cases were real participants, some of whom provided a fake date of birth due to concern of identity theft.

"We didn't trust people, but forgot that they, too, were suspicious of us," Hanson said.

While the published paper makes their multi-step protocol accessible to other researchers, it also enables AI to learn about such screening techniques and trick future fraud detection systems. For this reason, the paper's authors describe categories of filtering techniques, but not the exact details of each approach.

"There will always be this ongoing race to keep ahead of the bots," Hanson said.

Nevertheless, Hanson believes the benefit of sharing these tools with other researchers outweighs the cost of releasing their findings publicly. Ultimately, Hanson said, while automated techniques are useful in reducing the time spent actively reviewing enrollment data, they are insufficient.

"We need the human-to-human interaction with participants to ever be sure who they are," she said.

More information: Karla L. Hanson et al, Identifying and Removing Fraudulent Attempts to Enroll in a Human Health Improvement Intervention Trial in Rural Communities, Methods and Protocols (2024). DOI: 10.3390/mps7060093

Provided by Cornell University